Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Reduce the time taken for Huge Log files

Posted by pr19939 (pr19939), 18 March 2005
Hi,
   Can you please help me in optimizing the code given below. The code below iterates through a single log file of size 3 GB which has got many lines containing any one of the businesses specified in the Array. When it iterates with the first business against the Log file, it picks up the line contaning that particular business and writes it into a new file named after the name of the Business. Basically i will be using the consolidated business files to calculate the number of hits on each site.

Code :

#!/usr/bin/perl
my @businesses = (  
                   ["\"cfeurope.home.ge.com\"","new_cfeurope_home_ge_com.log"],
                   ["\"marketing.ge.com\"","new_marketing_ge_com.log"]                        
               );
my $rows = scalar(@businesses);
#---------------Code to get todays date ---------------------------------------
my $today = time();
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($today);  
$year += 1900;
$mon++;
$d = "$mday";
if ($d <10) {
       $d = "0$d";
}
$m ="$mon";
if ($m<10) {
       $m = "0$m";
}
$today = "$year" . "$m" . "$d";
#---------------Code to get yesterdays date ---------------------------------------
my $yesterday = time() - ( 24 * 60 * 60 );
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($yesterday);
$year += 1900;
$mon++;
$d = "$mday";
if ($d <10) {
       $d = "0$d";
}
$m ="$mon";
if ($m<10) {
       $m = "0$m";
}
$yesterday = "$year" . "$m" . "$d";
#-------------------------------------------------------------------------------
$outfile="consolidatedlog.txt";
my $count = 1;
open (OUT,">$outfile") or die ("Cud not open the $outfile");
opendir(DIR, "/inside29/urchin/test/logfiles") or die "couldn't open igelogs";
 while ( defined ($filename = readdir(DIR)) )
  {
   $index = index($filename,2005030;    
    if ($index > -1)
    {      
             my $date = localtime();
     print "$count The log $filename started at $date.\n";
 open(OLD,"/inside29/urchin/test/logfiles/$filename") || die ("Cud not open the $sourcefile");
  while (<OLD>) {  
     print OUT $_;          
 }
  close OLD;
 my $date = localtime();
print "$count The log $filename ended at $date.\n";
$count = $count + 1;
}
}
closedir (DIR);
close OUT;
#-------------------------------------------------------------
$ct = 0;
while ($ct < $rows) {
my $outfile = "/inside29/urchin/test/newfeed/20050307-monday-$businesses[$ct][1]";
my $newigebusiness = "$businesses[$ct][0]";
my $date = localtime();
print  "$ct log started for $newigebusiness at $date\n";
open(OUT,">>$outfile") || die("Could not open out file!$outfile");
open(OLD,"consolidatedlog.txt") || die ("C not open the $sourcefile");
  while ( <OLD>) {
   
    if ((index($_,$newigebusiness))> -1)
     {
     print OUT $_;
     }
     
  }
 close OLD;
  close OUT;
  my $date = localtime();
  print "$ct log created for $newigebusiness at $date\n";
  $ct = $ct + 1;
}  

End of Code.


Sample Log file :

Line 1:
[16/Jan/2005:00:00:40 -0500] "GET /ge/ige/1/1/4/common/cms_portletview2.html HTTP/1.1" 200 1702 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com"

Line 2:
[16/Jan/2005:00:00:40 -0500] "GET /portal/site/transportation/menuitem.8c65c5d7b286411eb198ed10b424aa30/ HTTP/1.1" 200 7596 0 "http://geae.home.ge.com/portal/site/transportation/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" "-" "geae.home.ge.com"

Line 3:
[16/Jan/2005:00:00:41 -0500] "GET /ge/ige/26/83/409/common/cms_portletview.html HTTP/1.1" 200 7240 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com"

Around a millon line will be there in the Log file

Please help.

Thanks


Posted by admin (Graham Ellis), 18 March 2005
Hi ... I have some ideas but it's not a quick answer ... I'll post up a fuller reply and some sample code in the next 24 hours.

Graham

P.S.  Might be a good idea if you trimmed your post - just leave the first few and last few lines of the declaration of "@businesses". Left as it is, you may be providing a database of systems that you really don't want to provide in this public place. Just a thought ....

Posted by admin (Graham Ellis), 18 March 2005
Thanks for trimming that list.

Now - how to answer?   I think I would tackle the application very differently; I'm very concerned at all the date manipulation and reading and re-reading of data - best to do it all at once.

I have a 300Mb set of log files for February, and to try out my ideas I wrote a script to read in them all and "spray" them out to a number of different files based on the last part of the host name or IP address ... and that turned out to be 410 output files.

You can't have such a lot of files open all the time (you would run out of file handles), so I collect records in memory (in a hash of lists), building up each list as I read the file.  Whenever any of the  hash elements gets to 5000 elements, I open the appropropriate output file for append, write the 5000 records out, and clear the list.

While doing this, by the way, I kept a total count of the number of records being written to each file ... which I think was your final objective??  

After handling the last file, a loop through all the elements of the has flushed out the final information to all the output files.

Here's the code:

Code:
start = localtime();
opendir(DH,"net");
while ($file = readdir DH) {
       next if (/^\./);
       print "Handling $file\n";
       open (FH,"net/$file");
       while (<FH>) {
               ($host) = /^\S+\.(\S+)\s/;
               $counter{$host}++;
               push @{$table{$host}},$_;
               if ($counter{$host} % 5000 == 0) {
                       open (FHO,">>netsep/$host");
                       print FHO @{$table{$host}};
                       @{$table{$host}} = ();
                       close FHO;
                       }
               }
       }
foreach $host (keys %table) {
       open (FHO,">>netsep/$host");
       print FHO @{$table{$host}};
       close FHO;
       }
$end = localtime();
print "Started $start and ended $end\n";


The elapsed time (run on my laptop) was 22 seconds. You might want to try adjusting the 5000 figure - with it set to 1000, the run time increased to 35 seconds, but increasing it to 10000 made no difference (and probably cause me to use a lot more memory)

Please feel free to adopt / adapt this approach - not identical to yours, but the metrics aren't much different.   I would guess that your code probably took a very long time indeed to run?  You haven;t told me what would be an improvement, so I hope I have achieved one!



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho