Speeding up your Perl code

On Friday morning - our Perl for Larger Project course - I was looking at coding efficiency / run speed with delegates. As an example, we took a data file from our web server logs - some 23 Mbytes of data, comprising about 121,000 lines, of which 1099 contained the word "melksham" in lower case.

The control case - [source] - took 1370 seconds (that's about 23 minutes) to run. I read the whole file into a long string scalar:
  open FH,"ac_20110723" or die;
  read FH,$buffer,-s "ac_20110723";

and then looked through it for "melksham" lines, which I reported:
  $count = 0;
  while ($buffer =~ /.*melksham.*/g) {
    print "$&\n" if ($debug);
    $count++ ;

well - I didn't quite report them (I set a debug flag off) because I didn't want to skew my figures by the time taken to scroll the information past the uses on the terminal output.

I'm often asked about efficient coding ... whether
  $n = $n + 1;
is slower than
  $n += 1;
and if that is slower than
The answer is "yes, it probably gets very slightly more efficient to use one of the shorter forms", but in reallity the difference is so slight that it really makes no practical difference in most cases. Let's see how we can make a serious difference to our data file analysis example above.

1. Start regular expressions with a literal

All I did at first was to add \n at the start of my regular expression. And that meant that the regular expression handler wasn't trying to match at every start character in the string - it only had to start at each new line. So
  while ($buffer =~ /.*melksham.*/g) {
  while ($buffer =~ /\n.*melksham.*/g) {
See [full source]

Tiny code difference? Yes ... but my 1370 seconds runtime dropped to ... just 22 seconds. That's over 60 times faster!

2. Start regular expressions with a zero width assetion (anchor)

If you're not able to find an appropriate literal to which to key, an anchor is a good but perhaps slightly less efficient alternative:
  while ($buffer =~ /^.*melksham.*/gm) {
see [full source]

which cut down from 1370 to 27 seconds - not as good as the 22 seconds of our first experiment, but still rather good.

3. Don't use $' $` or $& - find an alternative

If you refer to one of these three variables anywhere in your code, every regular expression match that you perform save out the three variables in case they're used. So in the our example, that's 25 Mbytes at every succsssful match. And my previous program never actually runs the code that makes the reference. Ouch!

Match becomes:
 &mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
and my reference to $& becomes to $1:
 &mbsp;print "$1\n" if ($debug);
see [full source]

and my 27 seconds drops to 7 seconds.

4. Should we read line by line, into a list, or into a single string?

I replace my read by a while loop that read the data line by line. Then I replaced it by a <> read into a list, which I parsed with a foreach. See [here] and [here].

Incredibly ... the 7 seconds drops to less than a second with foreach and an incredibly fast 0.16 of a second with a while loop. Yes - that's code running 8500 times faster than my control.

5. Does replacing a regular expression with a string function make it quicker?

Sometimes, yes ... but in this case, replacing
 &mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
  if (index($buffer,"melksham") >= 0) {
didn't make any noticable difference.

There are three factors ....
a) The regex has got very simple, and Perl has probably optimised it anyway (a good lesson in encouraging you to use straightforward code)
b) The text machine I'm using (a MacBook Air) has a big chunk of memory rather than a disc, which probably does funny things to the stats. There's certainly no wait time of disc drives ....
c) Times are so short that they can't be measured reliably on a single cycle.

Using the Benchmark module and the timethese method from it, you can rerun, time, and average out a series of tests ... and that's good for continuing to optimise. See [here] for full source.

Perhaps with a little more effort ... that original chunk of code could be 10,000 times faster ... and that's with just a little thought using the techniques described above.
(written 2011-07-30)

