Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Regular Expression Efficiency

Posted by enquirer (enquirer), 30 September 2002
A regex I wrote to parse our SunServer logfiles. From a quick glance...is it terribly inefficient?


Code:
#! /usr/bin/perl -w

use strict;

while (<DATA>) {

   $_ =~ m|^
            (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})        # capture  clientip
            \s                                          # followed by space
            ([\w-]+)\s                                  # caputre '-'  or their membership id
            \[(\d{1,2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})   # then the date
            \s\+\d{4}\]\s"                              # the '  +0100] "' ready for the method on the next line
            (\w{3,4})\s                                 # ermm, the  method
            (\/.*?)\s                                   # The request
            (\w{4}\/\d\.\d)"\s                          # the protocol
            (\d{3})\s([\d-]+?)\s"                       # status & content length
            (.+?)"\s"                                   # referer
            (.*?)"\s"                                   # useragent will need post processing
            (.+?)"                                      # All cookie  string, will need post processing
          |x;

   my $cookies = cookieStringCleaner($11);

   my ($persistant, $session);
   foreach my $loopvar (@$cookies) {

       if ($loopvar =~ /^eBizDAn/i) {
           $persistant = $loopvar;
       }
       elsif ($loopvar =~ /^eBizCo/i) {
           $session = $loopvar;
       }
   }

   print "\n\n\nLINE: $.\nIP: $1\nMEMBER: $2\nDATE: $3\nMETHOD: $4
\nREQUEST: $5\nPROTOCOL: $6\nSTATUS: $7\nCONLEN: $8\nREFERER:$9\nAGENT:
$10\nCOOKIE Persist: $persistant\nCOOKIE Session: $session";

}

#
# SUBROUTINES
#
sub cookieStringCleaner() {

   my $cookieString = shift;

   # clean up the data a bit, remove spaces and '-'
   # the '-' is an error by (other language)  random num generator.
   # taking it out will make lookups easier as they will just be a  number

   $cookieString =~ tr/ //d;
   $cookieString =~ tr/-//d;

   my @cookies = split(/;/, $cookieString);

   return \@cookies;
}

__DATA__




Posted by admin (Graham Ellis), 30 September 2002
Looks good and it's suprisingly efficient because it starts off with an anchor and there's a lot of very specific matches that mean there won't be a lot of going forward and backtracking.  All  your counts are specific or sparse - some authorities will tell you  that sparse is slow; per character it IS but because it saves so much forward and back stuff, overall it'll make for a quicker match.

Specific comments:

1. your match will fail if you use it on a data file from a server based in the USA, as they have a - not a + in the time zone difference field.

2. You don't need to specify $_ =~ - that's automatically there

3. If you save the result of your match into a list (such as
     ($ip,$date,$method ... etc ... ) = (m| .... etc ...);
the you have named variables for each part of the match which you might find easier to use and maintain laters that $11 and things  like that!

Just tiny things - hey, it's a gooden!!!

Posted by enquirer (enquirer), 30 September 2002
Would the 'o' modifier be good here? so that it compiles once?

Posted by admin (Graham Ellis), 30 September 2002
The o modifier only effects regular expressions that include a variable witrhin the regular expression.  Without a variable in the regular expression, Perl already knows that the expression won't change during the life of the program, so it only complies the regular expression once anyway.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho