Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
For 2021 - online Python 3 training - see ((here)).

Our plans were to retire in summer 2020 and see the world, but Coronavirus has lead us into a lot of lockdown programming in Python 3 and PHP 7.
We can now offer tailored online training - small groups, real tutors - works really well for groups of 4 to 14 delegates. Anywhere in the world; course language English.

Please ask about private 'maintenance' training for Python 2, Tcl, Perl, PHP, Lua, etc.
Macho matching - do not do it!

There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, and the second then doing the rest of the data extraction only for records where that's needed.

I wrote an example to show this on Thursday.

The scenario is that I have two log files, each containing 471100 records (yes - nearly half a million records each) and I need to scan them both - the first file for lines that refer to web accesses to a /course/ directory on our server, and the second file for lines that refer to arrivals from search engines. So that's nearly a million regular expression matches.

The original (macho) match for the first file was:
  if ($line1 =~ m!\d+:(\d+):(\d+):(\d+).*GET /course/! ) {
and for the second file:
  if ($line2 =~ /\d+:(\d+):(\d+):(\d+).*[?&]q=/ ) {

And when I ran my application, it found 4122 matching records in the first file, 54,969 in the second - and 204 of those overlapped:

Dorothy-2:de grahamellis$ perl mt2
4122 54969 204
4.04 0.95 0 0

The cpu time taken was 4.05 seconds, plus 0.95 of a second in system calls. Not bad for nearly a million complex matches - but can we be quicker with longer code, as I suggested earlier? Let's split the matches down:
  if ($line1 =~ m!GET /course/!) {
    $line1 =~ /\d+:(\d+):(\d+):(\d+)/ ;

  if ($line2 =~ m![?&]q=!) {
    $line2 =~ /\d+:(\d+):(\d+):(\d+)/ ;

Running that, I get the same results ... but 15% faster.

Dorothy-2:de grahamellis$ perl mt0
4122 54969 204
3.32 0.93 0 0

Can I do even better? It turns out that I can. If a regular expression starts with a pattern that can match in many places on each incoming line, it causes the engine within perl to start (and back off) what can be a considerable number of false beginnings - it can keep trying to match, failing, trying to match, failing ... at the expense of a lot of processor time. But if you're able to rephrase your regular expression so that it starts with a literal character, or something else which rarely matches, you can cut down on the number of these false dawns.

If I replace
then I have changed my regular expression to start it with a literal.

Running - the same answers yet again, but a further 5% performance gain on the application:

Dorothy-2:de grahamellis$ perl mt1
4122 54969 204
3.15 0.97 0 0

My customers this week have a huge amount of data to analyse - and not as simply as the example that I created as a demonstration. Their job takes 4 days to run. They have a target to reduce it to 2 or 3 days, and prior to the course were wondering if that could be possible. But I they went away quite optimistic - for if you extend the time tests I did, you'll see that I've already shown the sort of technique that can be used to reduce 96 hours to just under 75 hours - that's nearly a day saved already.

Full demonstration source code - [here].

There are a number of regular expression efficiency examples on our Per for larger projects course - the notes above apply to them too.

See ... [here] for our "control case" from the course where we have not worried about efficiency.

Then ... [here] is an improved example - simply by not using the global variables $& S' and $` which can make a significant saving in some circumstances.

Next ... [here] using the /o modifier, you can save Perl a lot of effort in re-analysing regular expressions, and with that also a lot of time (think also "compiled regular expressions")

If ... [as shown here] you start your regular expression with something much more specific, the number of lookahead and regress series is dramatically reduced, and you can speed up most tremendously.

Finally ... [here], we have modified the regular expression to begin with a start anchor and that ADDITION to the code causes the whole thing to eliminate non-matching cases very quickly indeed. The final code was five times faster (in the testing we did as we wrote the notes) that the original "control case" where - it must be admitted - we intentionally made a number of newbie's errors!

(written 2010-06-13, updated 2010-06-18)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Q804 - Object Orientation and General technical topics - Regular Expression Internals
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
  [2727] Making a Lua program run more than 10 times faster - (2010-04-16)
  [3090] Matching to a string - what if it matches in many possible ways? - (2010-12-17)
  [3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)

P667 - Perl - Handling Huge Data
  [639] Progress bars and other dynamic reports - (2006-03-09)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [975] Answering ALL the delegate's Perl questions - (2006-12-09)
  [1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
  [1920] Progress Bar Techniques - Perl - (2008-12-03)
  [1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
  [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
  [2805] How are you getting on? - (2010-06-13)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [3374] Speeding up your Perl code - (2011-07-30)
  [3375] How to interact with a Perl program while it is processing data - (2011-07-31)

Back to
How are you getting on?
Previous and next
Horse's mouth home
Forward to
Canal through Melksham - the options and issues
Some other Articles
A course review - for the tutor to complete
Frankfurt in 90 minutes
From home to Nurnberg - journey pictures
Canal through Melksham - the options and issues
Macho matching - do not do it!
Regular Expression Myths
Travelling across Europe
After the Perl course in Nurnberg
Binary data handling with unpack in Perl
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page

This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2021: 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/2806_Mac ... o-it-.html • PAGE BUILT: Sun Oct 11 16:07:41 2020 • BUILD SYSTEM: JelliaJamb