| |||||||||||
| |||||||||||
Reduce the time taken for Huge Log files Posted by pr19939 (pr19939), 18 March 2005 Hi,Can you please help me in optimizing the code given below. The code below iterates through a single log file of size 3 GB which has got many lines containing any one of the businesses specified in the Array. When it iterates with the first business against the Log file, it picks up the line contaning that particular business and writes it into a new file named after the name of the Business. Basically i will be using the consolidated business files to calculate the number of hits on each site. Code : #!/usr/bin/perl my @businesses = ( ["\"cfeurope.home.ge.com\"","new_cfeurope_home_ge_com.log"], ["\"marketing.ge.com\"","new_marketing_ge_com.log"] ); my $rows = scalar(@businesses); #---------------Code to get todays date --------------------------------------- my $today = time(); ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($today); $year += 1900; $mon++; $d = "$mday"; if ($d <10) { $d = "0$d"; } $m ="$mon"; if ($m<10) { $m = "0$m"; } $today = "$year" . "$m" . "$d"; #---------------Code to get yesterdays date --------------------------------------- my $yesterday = time() - ( 24 * 60 * 60 ); ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($yesterday); $year += 1900; $mon++; $d = "$mday"; if ($d <10) { $d = "0$d"; } $m ="$mon"; if ($m<10) { $m = "0$m"; } $yesterday = "$year" . "$m" . "$d"; #------------------------------------------------------------------------------- $outfile="consolidatedlog.txt"; my $count = 1; open (OUT,">$outfile") or die ("Cud not open the $outfile"); opendir(DIR, "/inside29/urchin/test/logfiles") or die "couldn't open igelogs"; while ( defined ($filename = readdir(DIR)) ) { $index = index($filename,2005030 ![]() if ($index > -1) { my $date = localtime(); print "$count The log $filename started at $date.\n"; open(OLD,"/inside29/urchin/test/logfiles/$filename") || die ("Cud not open the $sourcefile"); while (<OLD>) { print OUT $_; } close OLD; my $date = localtime(); print "$count The log $filename ended at $date.\n"; $count = $count + 1; } } closedir (DIR); close OUT; #------------------------------------------------------------- $ct = 0; while ($ct < $rows) { my $outfile = "/inside29/urchin/test/newfeed/20050307-monday-$businesses[$ct][1]"; my $newigebusiness = "$businesses[$ct][0]"; my $date = localtime(); print "$ct log started for $newigebusiness at $date\n"; open(OUT,">>$outfile") || die("Could not open out file!$outfile"); open(OLD,"consolidatedlog.txt") || die ("C not open the $sourcefile"); while ( <OLD>) { if ((index($_,$newigebusiness))> -1) { print OUT $_; } } close OLD; close OUT; my $date = localtime(); print "$ct log created for $newigebusiness at $date\n"; $ct = $ct + 1; } End of Code. Sample Log file : Line 1: [16/Jan/2005:00:00:40 -0500] "GET /ge/ige/1/1/4/common/cms_portletview2.html HTTP/1.1" 200 1702 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com" Line 2: [16/Jan/2005:00:00:40 -0500] "GET /portal/site/transportation/menuitem.8c65c5d7b286411eb198ed10b424aa30/ HTTP/1.1" 200 7596 0 "http://geae.home.ge.com/portal/site/transportation/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" "-" "geae.home.ge.com" Line 3: [16/Jan/2005:00:00:41 -0500] "GET /ge/ige/26/83/409/common/cms_portletview.html HTTP/1.1" 200 7240 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com" Around a millon line will be there in the Log file Please help. Thanks Posted by admin (Graham Ellis), 18 March 2005 Hi ... I have some ideas but it's not a quick answer ... I'll post up a fuller reply and some sample code in the next 24 hours.Graham P.S. Might be a good idea if you trimmed your post - just leave the first few and last few lines of the declaration of "@businesses". Left as it is, you may be providing a database of systems that you really don't want to provide in this public place. Just a thought .... Posted by admin (Graham Ellis), 18 March 2005 Thanks for trimming that list.Now - how to answer? I think I would tackle the application very differently; I'm very concerned at all the date manipulation and reading and re-reading of data - best to do it all at once. I have a 300Mb set of log files for February, and to try out my ideas I wrote a script to read in them all and "spray" them out to a number of different files based on the last part of the host name or IP address ... and that turned out to be 410 output files. You can't have such a lot of files open all the time (you would run out of file handles), so I collect records in memory (in a hash of lists), building up each list as I read the file. Whenever any of the hash elements gets to 5000 elements, I open the appropropriate output file for append, write the 5000 records out, and clear the list. While doing this, by the way, I kept a total count of the number of records being written to each file ... which I think was your final objective?? After handling the last file, a loop through all the elements of the has flushed out the final information to all the output files. Here's the code: Code:
The elapsed time (run on my laptop) was 22 seconds. You might want to try adjusting the 5000 figure - with it set to 1000, the run time increased to 35 seconds, but increasing it to 10000 made no difference (and probably cause me to use a lot more memory) Please feel free to adopt / adapt this approach - not identical to yours, but the metrics aren't much different. I would guess that your code probably took a very long time indeed to run? You haven;t told me what would be an improvement, so I hope I have achieved one! This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |