On Friday morning - our
Perl for Larger Project course - I was looking at coding efficiency / run speed with delegates. As an example, we took a data file from our web server logs - some 23 Mbytes of data, comprising about 121,000 lines, of which 1099 contained the word "melksham" in lower case.
The control case -
[source] - took 1370 seconds (that's about 23 minutes) to run. I read the whole file into a long string scalar:
open FH,"ac_20110723" or die;
read FH,$buffer,-s "ac_20110723";
and then looked through it for "melksham" lines, which I reported:
$count = 0;
while ($buffer =~ /.*melksham.*/g) {
print "$&\n" if ($debug);
$count++ ;
}
well - I didn't quite report them (I set a debug flag off) because I didn't want to skew my figures by the time taken to scroll the information past the uses on the terminal output.
I'm often asked about efficient coding ... whether
$n = $n + 1;
is slower than
$n += 1;
and if that is slower than
$n++;
The answer is "yes, it probably gets very slightly more efficient to use one of the shorter forms", but in reallity the difference is so slight that it really makes no practical difference in most cases. Let's see how we can make a
serious difference to our data file analysis example above.
1.
Start regular expressions with a literal
All I did at first was to add \n at the start of my regular expression. And that meant that the regular expression handler wasn't trying to match at every start character in the string - it only had to start at each new line. So
while ($buffer =~ /.*melksham.*/g) {
became
while ($buffer =~ /\n.*melksham.*/g) {
See
[full source]
Tiny code difference? Yes ... but my 1370 seconds runtime dropped to ... just 22 seconds. That's over 60 times faster!
2.
Start regular expressions with a zero width assetion (anchor)
If you're not able to find an appropriate literal to which to key, an anchor is a good but perhaps slightly less efficient alternative:
while ($buffer =~ /^.*melksham.*/gm) {
see
[full source]
which cut down from 1370 to 27 seconds - not as good as the 22 seconds of our first experiment, but still rather good.
3.
Don't use $' $` or $& - find an alternative
If you refer to one of these three variables
anywhere in your code, every regular expression match that you perform save out the three variables in case they're used. So in the our example, that's 25 Mbytes at every succsssful match. And my previous program never actually runs the code that makes the reference. Ouch!
Match becomes:
 &mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
and my reference to $& becomes to $1:
 &mbsp;print "$1\n" if ($debug);
see
[full source]
and my 27 seconds drops to 7 seconds.
4.
Should we read line by line, into a list, or into a single string?
I replace my
read by a
while loop that read the data line by line. Then I replaced it by a <> read into a list, which I parsed with a
foreach. See
[here] and
[here].
Incredibly ... the 7 seconds drops to less than a second with
foreach and an incredibly fast 0.16 of a second with a
while loop. Yes - that's code running 8500 times faster than my control.
5.
Does replacing a regular expression with a string function make it quicker?
Sometimes, yes ... but in this case, replacing
 &mbsp;while ($buffer =~ /^(.*melksham.*)/gm) {
by
if (index($buffer,"melksham") >= 0) {
didn't make any noticable difference.
There are three factors ....
a) The regex has got very simple, and Perl has probably optimised it anyway (a good lesson in encouraging you to use straightforward code)
b) The text machine I'm using (a MacBook Air) has a big chunk of memory rather than a disc, which probably does funny things to the stats. There's certainly no wait time of disc drives ....
c) Times are so short that they can't be measured reliably on a single cycle.
Using the
Benchmark module and the
timethese method from it, you can rerun, time, and average out a series of tests ... and that's good for continuing to optimise. See
[here] for full source.
Perhaps with a little more effort ... that original chunk of code could be 10,000 times faster ... and that's with just a little thought using the techniques described above.
(written 2011-07-30)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
P667 - Perl - Handling Huge Data [639] Progress bars and other dynamic reports - (2006-03-09)
[762] Huge data files - what happened earlier? - (2006-06-15)
[975] Answering ALL the delegate's Perl questions - (2006-12-09)
[1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
[1920] Progress Bar Techniques - Perl - (2008-12-03)
[1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
[2376] Long job - progress bar techniques (Perl) - (2009-08-26)
[2805] How are you getting on? - (2010-06-13)
[2806] Macho matching - do not do it! - (2010-06-13)
[2834] Teaching examples in Perl - third and final part - (2010-06-27)
[3375] How to interact with a Perl program while it is processing data - (2011-07-31)
Some other Articles
New product - ensuring that supply matches demandWhat do I mean when I add things in Perl?Kennet and Avon - Walk from Bedwyn to Pewsey. TransWilts day out.Speeding up your Perl codeAnother busy Week at Well House Manor ... pictures from the midweek Wearing the new London uniformFrom Wiltshire to Weymouth on Sundays Standing ChallengeLocal Council leads bans on many activities