Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
 
Python and Tcl - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Lua, etc
 
Macho matching - do not do it!

There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, and the second then doing the rest of the data extraction only for records where that's needed.

I wrote an example to show this on Thursday.

The scenario is that I have two log files, each containing 471100 records (yes - nearly half a million records each) and I need to scan them both - the first file for lines that refer to web accesses to a /course/ directory on our server, and the second file for lines that refer to arrivals from search engines. So that's nearly a million regular expression matches.

The original (macho) match for the first file was:
  if ($line1 =~ m!\d+:(\d+):(\d+):(\d+).*GET /course/! ) {
and for the second file:
  if ($line2 =~ /\d+:(\d+):(\d+):(\d+).*[?&]q=/ ) {

And when I ran my application, it found 4122 matching records in the first file, 54,969 in the second - and 204 of those overlapped:

Dorothy-2:de grahamellis$ perl mt2
4122 54969 204
4.04 0.95 0 0


The cpu time taken was 4.05 seconds, plus 0.95 of a second in system calls. Not bad for nearly a million complex matches - but can we be quicker with longer code, as I suggested earlier? Let's split the matches down:
  if ($line1 =~ m!GET /course/!) {
    $line1 =~ /\d+:(\d+):(\d+):(\d+)/ ;

and
  if ($line2 =~ m![?&]q=!) {
    $line2 =~ /\d+:(\d+):(\d+):(\d+)/ ;


Running that, I get the same results ... but 15% faster.

Dorothy-2:de grahamellis$ perl mt0
4122 54969 204
3.32 0.93 0 0


Can I do even better? It turns out that I can. If a regular expression starts with a pattern that can match in many places on each incoming line, it causes the engine within perl to start (and back off) what can be a considerable number of false beginnings - it can keep trying to match, failing, trying to match, failing ... at the expense of a lot of processor time. But if you're able to rephrase your regular expression so that it starts with a literal character, or something else which rarely matches, you can cut down on the number of these false dawns.

If I replace
  /\d+:(\d+):(\d+):(\d+)/
with
  /:(\d+):(\d+):(\d+)/
then I have changed my regular expression to start it with a literal.

Running - the same answers yet again, but a further 5% performance gain on the application:

Dorothy-2:de grahamellis$ perl mt1
4122 54969 204
3.15 0.97 0 0


Learning how to sort a hashMy customers this week have a huge amount of data to analyse - and not as simply as the example that I created as a demonstration. Their job takes 4 days to run. They have a target to reduce it to 2 or 3 days, and prior to the course were wondering if that could be possible. But I they went away quite optimistic - for if you extend the time tests I did, you'll see that I've already shown the sort of technique that can be used to reduce 96 hours to just under 75 hours - that's nearly a day saved already.


Full demonstration source code - [here].




There are a number of regular expression efficiency examples on our Per for larger projects course - the notes above apply to them too.

See ... [here] for our "control case" from the course where we have not worried about efficiency.

Then ... [here] is an improved example - simply by not using the global variables $& S' and $` which can make a significant saving in some circumstances.

Next ... [here] using the /o modifier, you can save Perl a lot of effort in re-analysing regular expressions, and with that also a lot of time (think also "compiled regular expressions")

If ... [as shown here] you start your regular expression with something much more specific, the number of lookahead and regress series is dramatically reduced, and you can speed up most tremendously.

Finally ... [here], we have modified the regular expression to begin with a start anchor and that ADDITION to the code causes the whole thing to eliminate non-matching cases very quickly indeed. The final code was five times faster (in the testing we did as we wrote the notes) that the original "control case" where - it must be admitted - we intentionally made a number of newbie's errors!



(written 2010-06-13, updated 2010-06-18)

 
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
P667 - Perl - Handling Huge Data
  [3375] How to interact with a Perl program while it is processing data - (2011-07-31)
  [3374] Speeding up your Perl code - (2011-07-30)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [2805] How are you getting on? - (2010-06-13)
  [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
  [1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
  [1920] Progress Bar Techniques - Perl - (2008-12-03)
  [1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
  [975] Answering ALL the delegate's Perl questions - (2006-12-09)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [639] Progress bars and other dynamic reports - (2006-03-09)

Q804 - Object Orientation and General technical topics - Regular Expression Internals
  [3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)
  [3090] Matching to a string - what if it matches in many possible ways? - (2010-12-17)
  [2727] Making a Lua program run more than 10 times faster - (2010-04-16)
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)


Back to
How are you getting on?
Previous and next
or
Horse's mouth home
Forward to
Canal through Melksham - the options and issues
Some other Articles
A course review - for the tutor to complete
Frankfurt in 90 minutes
From home to Nurnberg - journey pictures
Canal through Melksham - the options and issues
Macho matching - do not do it!
Regular Expression Myths
Travelling across Europe
After the Perl course in Nurnberg
Binary data handling with unpack in Perl
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2019: 404 The Spa • Melksham, Wiltshire • United Kingdom • SN12 6QL
PH: 01225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/2806_Mac ... o-it-.html • PAGE BUILT: Sat May 27 16:49:10 2017 • BUILD SYSTEM: WomanWithCat