Training, Open Source Programming Languages

This is page ... o-it-.html

Our email: • Phone: 01144 1225 708225

For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Macho matching - do not do it!

There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, and the second then doing the rest of the data extraction only for records where that's needed.

I wrote an example to show this on Thursday.

The scenario is that I have two log files, each containing 471100 records (yes - nearly half a million records each) and I need to scan them both - the first file for lines that refer to web accesses to a /course/ directory on our server, and the second file for lines that refer to arrivals from search engines. So that's nearly a million regular expression matches.

The original (macho) match for the first file was:
  if ($line1 =~ m!\d+:(\d+):(\d+):(\d+).*GET /course/! ) {
and for the second file:
  if ($line2 =~ /\d+:(\d+):(\d+):(\d+).*[?&]q=/ ) {

And when I ran my application, it found 4122 matching records in the first file, 54,969 in the second - and 204 of those overlapped:

Dorothy-2:de grahamellis$ perl mt2
4122 54969 204
4.04 0.95 0 0

The cpu time taken was 4.05 seconds, plus 0.95 of a second in system calls. Not bad for nearly a million complex matches - but can we be quicker with longer code, as I suggested earlier? Let's split the matches down:
  if ($line1 =~ m!GET /course/!) {
    $line1 =~ /\d+:(\d+):(\d+):(\d+)/ ;

  if ($line2 =~ m![?&]q=!) {
    $line2 =~ /\d+:(\d+):(\d+):(\d+)/ ;

Running that, I get the same results ... but 15% faster.

Dorothy-2:de grahamellis$ perl mt0
4122 54969 204
3.32 0.93 0 0

Can I do even better? It turns out that I can. If a regular expression starts with a pattern that can match in many places on each incoming line, it causes the engine within perl to start (and back off) what can be a considerable number of false beginnings - it can keep trying to match, failing, trying to match, failing ... at the expense of a lot of processor time. But if you're able to rephrase your regular expression so that it starts with a literal character, or something else which rarely matches, you can cut down on the number of these false dawns.

If I replace
then I have changed my regular expression to start it with a literal.

Running - the same answers yet again, but a further 5% performance gain on the application:

Dorothy-2:de grahamellis$ perl mt1
4122 54969 204
3.15 0.97 0 0

My customers this week have a huge amount of data to analyse - and not as simply as the example that I created as a demonstration. Their job takes 4 days to run. They have a target to reduce it to 2 or 3 days, and prior to the course were wondering if that could be possible. But I they went away quite optimistic - for if you extend the time tests I did, you'll see that I've already shown the sort of technique that can be used to reduce 96 hours to just under 75 hours - that's nearly a day saved already.

Full demonstration source code - [here].

There are a number of regular expression efficiency examples on our Per for larger projects course - the notes above apply to them too.

See ... [here] for our "control case" from the course where we have not worried about efficiency.

Then ... [here] is an improved example - simply by not using the global variables $& S' and $` which can make a significant saving in some circumstances.

Next ... [here] using the /o modifier, you can save Perl a lot of effort in re-analysing regular expressions, and with that also a lot of time (think also "compiled regular expressions")

If ... [as shown here] you start your regular expression with something much more specific, the number of lookahead and regress series is dramatically reduced, and you can speed up most tremendously.

Finally ... [here], we have modified the regular expression to begin with a start anchor and that ADDITION to the code causes the whole thing to eliminate non-matching cases very quickly indeed. The final code was five times faster (in the testing we did as we wrote the notes) that the original "control case" where - it must be admitted - we intentionally made a number of newbie's errors!

(written 2010-06-13, updated 2010-06-18)

Associated topics are indexed as below, or enter for individual articles
Q804 - Object Orientation and General technical topics - Regular Expression Internals
  [1480] Next course - 7th January 2008, Regular Expressions - (2007-12-21)
  [2727] Making a Lua program run more than 10 times faster - (2010-04-16)
  [3090] Matching to a string - what if it matches in many possible ways? - (2010-12-17)
  [3091] How do regular expressions work / Regular Expression diagrams - (2010-12-17)

P667 - Perl - Handling Huge Data
  [639] Progress bars and other dynamic reports - (2006-03-09)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [975] Answering ALL the delegate's Perl questions - (2006-12-09)
  [1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
  [1920] Progress Bar Techniques - Perl - (2008-12-03)
  [1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
  [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
  [2805] How are you getting on? - (2010-06-13)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [3374] Speeding up your Perl code - (2011-07-30)
  [3375] How to interact with a Perl program while it is processing data - (2011-07-31)

Back to
How are you getting on?
Previous and next
Horse's mouth home
Forward to
Canal through Melksham - the options and issues
Some other Articles
A course review - for the tutor to complete
Frankfurt in 90 minutes
From home to Nurnberg - journey pictures
Canal through Melksham - the options and issues
Macho matching - do not do it!
Regular Expression Myths
Travelling across Europe
After the Perl course in Nurnberg
Binary data handling with unpack in Perl
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page

This is a page archived from The Horse's Mouth at - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

© WELL HOUSE CONSULTANTS LTD., 2024: 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • EMAIL: • WEB: • SKYPE: wellho

PAGE: ... o-it-.html • PAGE BUILT: Sun Oct 11 16:07:41 2020 • BUILD SYSTEM: JelliaJamb