Speeding up your Perl code

For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))

On Friday morning - our Perl for Larger Project course - I was looking at coding efficiency / run speed with delegates. As an example, we took a data file from our web server logs - some 23 Mbytes of data, comprising about 121,000 lines, of which 1099 contained the word "melksham" in lower case.

The control case - [source] - took 1370 seconds (that's about 23 minutes) to run. I read the whole file into a long string scalar:

  open FH,"ac_20110723" or die;

  read FH,$buffer,-s "ac_20110723";

and then looked through it for "melksham" lines, which I reported:

  $count = 0;

  while ($buffer =~ /.*melksham.*/g) {

    print "$&\n" if ($debug);

    $count++ ;

    }

well - I didn't quite report them (I set a debug flag off) because I didn't want to skew my figures by the time taken to scroll the information past the uses on the terminal output.

I'm often asked about efficient coding ... whether
  $n = $n + 1;
is slower than
  $n += 1;
and if that is slower than
  $n++;
The answer is "yes, it probably gets very slightly more efficient to use one of the shorter forms", but in reallity the difference is so slight that it really makes no practical difference in most cases. Let's see how we can make a serious difference to our data file analysis example above.

1. Start regular expressions with a literal

All I did at first was to add \n at the start of my regular expression. And that meant that the regular expression handler wasn't trying to match at every start character in the string - it only had to start at each new line. So
  while ($buffer =~ /.*melksham.*/g) {
became
  while ($buffer =~ /\n.*melksham.*/g) {
See [full source]

Tiny code difference? Yes ... but my 1370 seconds runtime dropped to ... just 22 seconds. That's over 60 times faster!

2. Start regular expressions with a zero width assetion (anchor)

If you're not able to find an appropriate literal to which to key, an anchor is a good but perhaps slightly less efficient alternative:
  while ($buffer =~ /^.*melksham.*/gm) {
see [full source]

which cut down from 1370 to 27 seconds - not as good as the 22 seconds of our first experiment, but still rather good.

3. Don't use $' $` or $& - find an alternative

If you refer to one of these three variables anywhere in your code, every regular expression match that you perform save out the three variables in case they're used. So in the our example, that's 25 Mbytes at every succsssful match. And my previous program never actually runs the code that makes the reference. Ouch!

Match becomes:
&mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
and my reference to $& becomes to $1:
&mbsp;print "$1\n" if ($debug);
see [full source]

and my 27 seconds drops to 7 seconds.

4. Should we read line by line, into a list, or into a single string?

I replace my read by a while loop that read the data line by line. Then I replaced it by a <> read into a list, which I parsed with a foreach. See [here] and [here].

Incredibly ... the 7 seconds drops to less than a second with foreach and an incredibly fast 0.16 of a second with a while loop. Yes - that's code running 8500 times faster than my control.

5. Does replacing a regular expression with a string function make it quicker?

Sometimes, yes ... but in this case, replacing
&mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
by
  if (index($buffer,"melksham") >= 0) {
didn't make any noticable difference.

There are three factors ....
a) The regex has got very simple, and Perl has probably optimised it anyway (a good lesson in encouraging you to use straightforward code)
b) The text machine I'm using (a MacBook Air) has a big chunk of memory rather than a disc, which probably does funny things to the stats. There's certainly no wait time of disc drives ....
c) Times are so short that they can't be measured reliably on a single cycle.

Using the Benchmark module and the timethese method from it, you can rerun, time, and average out a series of tests ... and that's good for continuing to optimise. See [here] for full source.

Perhaps with a little more effort ... that original chunk of code could be 10,000 times faster ... and that's with just a little thought using the techniques described above.
(written 2011-07-30)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles

P667 - Perl - Handling Huge Data
  [639] Progress bars and other dynamic reports - (2006-03-09)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [975] Answering ALL the delegate's Perl questions - (2006-12-09)
  [1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
  [1920] Progress Bar Techniques - Perl - (2008-12-03)
  [1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
  [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
  [2805] How are you getting on? - (2010-06-13)
  [2806] Macho matching - do not do it! - (2010-06-13)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [3375] How to interact with a Perl program while it is processing data - (2011-07-31)

Back to
Another busy Week at Well House Manor ... pictures from the midweek

Previous and next
or
Horse's mouth home

Forward to
How to interact with a Perl program while it is processing data

Some other Articles

New product - ensuring that supply matches demand
What do I mean when I add things in Perl?
Kennet and Avon - Walk from Bedwyn to Pewsey. TransWilts day out.
Speeding up your Perl code
Another busy Week at Well House Manor ... pictures from the midweek
Wearing the new London uniform
From Wiltshire to Weymouth on Sundays
Standing Challenge
Local Council leads bans on many activities

4759 posts, page by page

Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page

This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

Like this? ??