There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, and the second then doing the rest of the data extraction only for records where that's needed.
I wrote an example to show this on Thursday.
The scenario is that I have two log files, each containing 471100 records (yes - nearly half a million records each) and I need to scan them both - the first file for lines that refer to web accesses to a /course/ directory on our server, and the second file for lines that refer to arrivals from search engines. So that's nearly a million regular expression matches.
The original (macho) match for the first file was:
if ($line1 =~ m!\d+:(\d+):(\d+):(\d+).*GET /course/! ) {
and for the second file:
if ($line2 =~ /\d+:(\d+):(\d+):(\d+).*[?&]q=/ ) {
And when I ran my application, it found 4122 matching records in the first file, 54,969 in the second - and 204 of those overlapped:
Dorothy-2:de grahamellis$ perl mt2
4122 54969 204
4.04 0.95 0 0
The cpu time taken was 4.05 seconds, plus 0.95 of a second in system calls. Not bad for nearly a million complex matches - but can we be quicker with longer code, as I suggested earlier? Let's split the matches down:
if ($line1 =~ m!GET /course/!) {
$line1 =~ /\d+:(\d+):(\d+):(\d+)/ ;
and
if ($line2 =~ m![?&]q=!) {
$line2 =~ /\d+:(\d+):(\d+):(\d+)/ ;
Running that, I get the same results ... but 15% faster.
Dorothy-2:de grahamellis$ perl mt0
4122 54969 204
3.32 0.93 0 0
Can I do even better? It turns out that I can. If a regular expression starts with a pattern that can match in many places on each incoming line, it causes the engine within perl to start (and back off) what can be a considerable number of false beginnings - it can keep trying to match, failing, trying to match, failing ... at the expense of a lot of processor time. But if you're able to rephrase your regular expression so that it starts with a literal character, or something else which rarely matches, you can cut down on the number of these false dawns.
If I replace
/\d+:(\d+):(\d+):(\d+)/
with
/:(\d+):(\d+):(\d+)/
then I have changed my regular expression to start it with a literal.
Running - the same answers yet again, but a further 5% performance gain on the application:
Dorothy-2:de grahamellis$ perl mt1
4122 54969 204
3.15 0.97 0 0
My customers this week have a huge amount of data to analyse - and not as simply as the example that I created as a demonstration. Their job takes 4 days to run. They have a target to reduce it to 2 or 3 days, and prior to the course were wondering if that could be possible. But I they went away quite optimistic - for if you extend the time tests I did, you'll see that I've already shown the sort of technique that can be used to reduce 96 hours to just under 75 hours - that's nearly a day saved already.
Full demonstration source code -
[here].
There are a number of regular expression efficiency examples on our
Per for larger projects course - the notes above apply to them too.
See ...
[here] for our "control case" from the course where we have not worried about efficiency.
Then ...
[here] is an improved example - simply by not using the global variables $& S' and $` which can make a significant saving in some circumstances.
Next ...
[here] using the /o modifier, you can save Perl a lot of effort in re-analysing regular expressions, and with that also a lot of time (think also "compiled regular expressions")
If ...
[as shown here] you start your regular expression with something much more specific, the number of lookahead and regress series is dramatically reduced, and you can speed up most tremendously.
Finally ...
[here], we have modified the regular expression to begin with a start anchor and that ADDITION to the code causes the whole thing to eliminate non-matching cases very quickly indeed. The final code was five times faster (in the testing we did as we wrote the notes) that the original "control case" where - it must be admitted - we intentionally made a number of newbie's errors!
(written 2010-06-13, updated 2010-06-18)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Q804 - Object Orientation and General technical topics - Regular Expression Internals [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
[2727] Making a Lua program run more than 10 times faster - (2010-04-16)
[3090] Matching to a string - what if it matches in many possible ways? - (2010-12-17)
[3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)
P667 - Perl - Handling Huge Data [639] Progress bars and other dynamic reports - (2006-03-09)
[762] Huge data files - what happened earlier? - (2006-06-15)
[975] Answering ALL the delegate's Perl questions - (2006-12-09)
[1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
[1920] Progress Bar Techniques - Perl - (2008-12-03)
[1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
[2376] Long job - progress bar techniques (Perl) - (2009-08-26)
[2805] How are you getting on? - (2010-06-13)
[2834] Teaching examples in Perl - third and final part - (2010-06-27)
[3374] Speeding up your Perl code - (2011-07-30)
[3375] How to interact with a Perl program while it is processing data - (2011-07-31)
Some other Articles
A course review - for the tutor to completeFrankfurt in 90 minutesFrom home to Nurnberg - journey picturesCanal through Melksham - the options and issuesMacho matching - do not do it!Regular Expression MythsTravelling across EuropeAfter the Perl course in NurnbergBinary data handling with unpack in Perl