What are the biggest 10 files in or below this directory?
What are the 20 'worst' spams I have received in the last month?
What are the five top scores recorded for a popular game on my web site?
It's a very common requirement indeed to provide a program to answer questions like these, and if you've only got a handful of files /spams / score records, it's easy to write a program to read them all into an array (PHP) or list (Perl, Python), sort that array or list when you've read them all, and print out the first however-many. But that approach becomes impractically slow and memory greedy if you have a big log file ... as for example the quarter of a million records I have in my spam record file at the moment.
Here's the technique you can use to find
the top 20 records from several million in a log file - quickly, and efficiently ....
1) Set up an empty list to contain the top 20 (so far) as you discover them.
2) pass through the record one by one ...
2a) Work out the comparsion factor (size, score) for the record just read
2b) If you have already read and stored 20 records, and the new record is below the 20th one stored, reject it OTHERWISE ...
2c) Step through the records retained so far and insert the new one at the appropriate place in the list
2d) If the list now contains more that 20 records, truncate it to 20
3) Print out your results from the list you now have.
You can see this algorithm implemented in PHP
here and you can run it
here. It's not the simplest of code, but it should aways work no matter how large or how small the cutoff between the 20th and 21st record is (as opposed to alternative algorithms that set a threshhold), and it should always work quite fast even on a data set that's pretty huge; most of the data will be rejected summarily and won't need to be stored at all.
You
might suggest that my data should be stored in a MySQL database and not a plain text file ... that's not the problem I was given, and is worthy of an entry here another day!
(written 2006-08-20 01:58:48)
Associated topics are indexed under
H115 - Designing PHP-Based Solutions: Best Practice [2430] Not just a PHP program - a good web application - (2009-09-29)
[2221] Adding a newsfeed for your users to a multipage PHP application - (2009-06-06)
[2199] Improving the structure of your early PHP programs - (2009-05-25)
[1794] Refactoring - a PHP demo becomes a production page - (2008-09-12)
[1694] Defensive coding techniques in PHP? - (2008-07-02)
[1623] PHP Techniques - a workshop - (2008-04-26)
[1533] Short and sweet and sticky - PHP form input - (2008-02-06)
[1490] Software to record day to day events and keep an action list - (2007-12-31)
[1487] Efficient PHP applications - framework and example - (2007-12-28)
[1482] A story about benchmarking PHP - (2007-12-23)
[1391] Ordnance Survey Grid Reference to Latitude / Longitude - (2007-10-14)
[1390] Converting from postal address to latitude / longitude - (2007-10-13)
[1389] Controlling and labelling Google maps via PHP - (2007-10-13)
[1381] Using a MySQL database to control mod_rewrite via PHP - (2007-10-06)
[1323] Easy handling of errors in PHP - (2007-08-27)
[1321] Resetting session based tests in PHP - (2007-08-26)
[1194] Drawing hands on a clock face - PHP - (2007-05-19)
[1182] Painting a masterpiece in PHP - (2007-05-10)
[1181] Good Programming practise - where to initialise variables - (2007-05-09)
[1166] Back button - ensuring order are not submitted twice (PHP) - (2007-04-28)
[1052] Learning to write secure, maintainable PHP - (2007-01-25)
[1047] Maintainable code - some positive advice - (2007-01-21)
[945] Code quality counts - (2006-11-26)
[936] Global, Superglobal, Session variables - scope and persistance in PHP - (2006-11-21)
[896] PHP - good coding practise and sticky radio buttons - (2006-10-17)
[572] Giving the researcher power over database analysis - (2006-01-22)
[563] Merging pictures using PHP and GD - (2006-01-13)
[426] Robust checking of data entered by users - (2005-08-27)
[394] A year on - should we offer certified PHP courses - (2005-07-28)
[340] Code and code maintainance efficiency - (2005-06-08)
[261] Putting a form online - (2005-03-29)
[237] Crossfertilisation, PHP to Python - (2005-03-06)
[123] Short underground journeys and a PHP book - (2004-11-19)
H999 - Additional PHP Material [2215] If nothing, make it nothing. - (2009-06-02)
[2073] Extra PHP Examples - (2009-03-09)
[1519] Flipping images on your web page - (2008-01-26)
[1505] Script to present commonly used images - PHP - (2008-01-13)
[1485] Copyright and theft of images, bandwidth and members. - (2007-12-26)
[1451] More PHP sample and demonstration programs - (2007-12-01)
[1270] PHP Standalone - keyboard to screen - (2007-07-18)
[1104] Drawing dynamic graphs in PHP - (2007-03-09)
[1053] Sorting people by name in PHP - (2007-01-26)
[1020] Parallel processing in PHP - (2007-01-03)
[1010] Dates, times, clickable diarys in PHP - (2006-12-28)
[937] Display an image from a MySQL database in a web page via PHP - (2006-11-22)
[917] Syntax checking in PHP - (2006-11-07)
[822] PHP - a team member leaves - (2006-08-04)
[806] Check your user is human. Have him retype a word in a graphic - (2006-07-17)
[789] Hot answers in PHP - (2006-07-02)
[687] Presentation, Business and Persistence layers in Perl and PHP - (2006-04-17)
[665] PHP Image viewing application - (2006-04-01)
[603] PHP - setting sort order with an associative array - (2006-02-13)
[493] Running a Perl script within a PHP page - (2005-11-12)
[483] Double Dollars in PHP - (2005-11-02)
[468] Stand alone PHP programs - (2005-10-18)
[372] Time calculation in PHP - (2005-07-08)
[337] the array returned by preg_match_all - (2005-06-06)
[322] More maps - (2005-05-23)
[320] Ordnance Survey - using a 'Get a map' - (2005-05-22)
[239] What and why for the epoch - (2005-03-08)
[54] PHP and natural sorting - (2004-09-19)
P602 - Perl - Advanced File and Directory Handling [1861] Reactive (dynamic) formatting in Perl - (2008-10-31)
[1832] Processing all files in a directory - Perl - (2008-10-11)
[1709] There is more that one way - Perl - (2008-07-14)
[1225] Perl - functions for directory handling - (2007-06-09)
[975] Answering ALL the delegate's Perl questions - (2006-12-09)
Some other Articles
To join an organisation?Dramatic Skys at LongleatForum help - a push in the right directionComputers, Brides and Cream TeasReporting on the 10 largest files or 10 top scoresTalking about other training companies.Tomcat - Shutdown portBuild on what you already have with OOPython - when to use the in operatorPython makes University Challenge