Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
20.9.2014 - We have just updated our course layouts and descriptions and added our 2015 schedule.

Filter Large Log Files.

Posted by pr19939 (pr19939), 1 February 2005
Hi Ellis,
   I came across your following code for bulk filtering of data.
%wanted = (
      "members.aol.com" => 1,
      "www.geocities.com" => 1,
      "groups.yahoo.com" => 1,
      "home.earthlink.net" => 1
      );

# Parse the incoming data looking for matches

while ($line = <>) {
      ($server) = ($line =~ m!//(.*?)/!);
      if ($wanted{$server}) {
              print $line;
              }
      }

I tried the above code.But no luck.
What i need is the opposite of the above code.I want to filter the lines having .gif and .jpg extensions from a 350,000 lines logfile.
Right now i used the index search method.It takes 3 hours.So i would like to know the best approach.Time frame is my main concern.Sample code

$fileArray;
my $totlogfile = "$today-TotalLogFile";
my $totlogfile1 = "today-TotalLogFile1";
my $totlogfilebkup="TotalLogFileBkup";
open(total,">$totlogfilebkup") || die("Could not open out file!$outfile");#outfile is declared before
opendir(DIR, "logfiles") or die "couldn't open logs";
         while ( defined ($filename = readdir(DIR)) )
          {
            $index = index($filename,$yesterday);
             if ($index > -1)
             {
               $fileArray[$count] = $filename;
               $count = $count + 1;
                       print "The log file name is $filename.\n";
                     open(logfile,"$filename") || die("Couldx not open file! $logfilename");#$logfilename declared
                       while($line = <logfile>)
                       {
                           chomp($line);
                           unless(( $line =~ /\.gif/i ) || ( $line =~ /\.jpg/i ) || ( $line =~ /\.jpeg/i ) || ( $line =~ /\.js/i ) || ( $line =~ /\.css/i ) || ( $line =~ /tickerServlet/i ) || ( $line =~ /nagios/i ) || ( $line =~ /statusservlet/i ))
                           {
                             print total "$line\n";                                
                           }
                       }
                     close logfile;                        
               }
          }
       closedir(DIR);
close total;      


Can you please help?

Thanks

Thanks in advance.

Posted by admin (Graham Ellis), 1 February 2005
I'm not surprised that it's slow ... but 3 hours??  

I've just run some tests with my own log files and your filter; I have 700,000 lines in 31 log files for January, so that's twice the amount of data you have, and I ran the following as a "control" case - similar to your code but without the prints:

Code:
opendir (DH,".");

while ($file = readdir(DH)) {
       next if ($file !~ /^ac/);
       open (FH,$file);
       print "$file\n";
       while (<FH>) {
       chomp;
       unless (/\.jpg/i || /\.gif/i || /\.jpeg/i ||
               /\.js/i || /\.css/i || /tickerSerlvet/i ||
               /nagios/i || /statusservlet/i) {
               $wanted ++;
               }

       $all++;

       }
       print "$file $wanted $all\n";
}
@taken = times();
print "@taken\n";


That took 8.1 seconds

By removing the ignore case option on the regular expressions, that time dropped to 4.5 seconds.

By removing the chomp (why are you chomping when you add the \n back on later?) it came down to 4.3 seconds

By using separate tests rather than stringing them all together with a || it came down to 4.1 seconds

By using index rather than regular expressions, it came down to 2.9 seconds.   Here's the final code I was running:

Code:
opendir (DH,".");

while ($file = readdir(DH)) {
       next if ($file !~ /^ac/);
       open (FH,$file);
       print "$file\n";
       while (<FH>) {
       $all++;
       next if (index($_,'.jpg') < 0);
       next if (index($_,'.gif') < 0);
       next if (index($_,'.jpeg') < 0);
       next if (index($_,'.js') < 0);
       next if (index($_,'.css') < 0);
       next if (index($_,'tickerServlet') < 0);
       next if (index($_,'nagios') < 0);
       next if (index($_,'statusservlet') < 0);
       $wanted ++;

       }
       print "$file $wanted $all\n";
}
@taken = times();
print "@taken\n";


As a final test, I took out all the record filtering any my script ran in 2.1 seconds ... that really shows how the time was used!

Suggestion - try out my changes in your program as appropriate. I don't know where your 3 hours is going though ... seems very odd.   And use the times function (perhaps run it in several places) to help you find where the efficiency issue is

Posted by pr19939 (pr19939), 8 February 2005
Hi Ellis,
         I tried your sample code.the first code worked fine and printed the number of lines satisfying the search criteria.But i could not find a way to write the output into a new file.I am not able to use two file handles.(One for opening the log file to be read and the other for writing the output into a new file).

The "next-if" structure did not work.
Please advice.

Thanks in advance

Posted by admin (Graham Ellis), 8 February 2005
on 02/08/05 at 09:34:55, pr19939 wrote:
Hi Ellis,
         I tried your sample code.the first code worked fine and printed the number of lines satisfying the search criteria.But i could not find a way to write the output into a new file.I am not able to use two file handles.(One for opening the log file to be read and the other for writing the output into a new file).


You can open a file for write uisng
     open (FHO,">abbcc.txt");
and write to it using
     print FHO;
or similar

Quote:
The "next-if" structure did not work.
Please advice.

Thanks in advance


That's a tricky one; my examples were written and tested against my log files.   If you can tell me how it failed for you (compile error, didn't work at runtime, something else), and perhaps post up a line or two of the code that failed, I'll be much better able to offer specific advise.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho