Training, Open Source computer languages

PerlPHPPythonMySQLhttpd / TomcatTclRubyJavaC and C++LinuxCSS

Search our site for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Reporting on the 10 largest files or 10 top scores
What are the biggest 10 files in or below this directory?

What are the 20 'worst' spams I have received in the last month?

What are the five top scores recorded for a popular game on my web site?

It's a very common requirement indeed to provide a program to answer questions like these, and if you've only got a handful of files /spams / score records, it's easy to write a program to read them all into an array (PHP) or list (Perl, Python), sort that array or list when you've read them all, and print out the first however-many. But that approach becomes impractically slow and memory greedy if you have a big log file ... as for example the quarter of a million records I have in my spam record file at the moment.

Here's the technique you can use to find the top 20 records from several million in a log file - quickly, and efficiently ....

1) Set up an empty list to contain the top 20 (so far) as you discover them.

2) pass through the record one by one ...
2a) Work out the comparsion factor (size, score) for the record just read
2b) If you have already read and stored 20 records, and the new record is below the 20th one stored, reject it OTHERWISE ...
2c) Step through the records retained so far and insert the new one at the appropriate place in the list
2d) If the list now contains more that 20 records, truncate it to 20

3) Print out your results from the list you now have.

You can see this algorithm implemented in PHP here and you can run it here. It's not the simplest of code, but it should aways work no matter how large or how small the cutoff between the 20th and 21st record is (as opposed to alternative algorithms that set a threshhold), and it should always work quite fast even on a data set that's pretty huge; most of the data will be rejected summarily and won't need to be stored at all.

You might suggest that my data should be stored in a MySQL database and not a plain text file ... that's not the problem I was given, and is worthy of an entry here another day!
(written 2006-08-20 01:58:48)

 
Associated topics are indexed under
H999 - Additional PHP Material
P602 - Perl - Advanced File and Directory Handling
H115 - Designing PHP-Based Solutions: Best Practice

Back to
Talking about other training companies.
Previous and next
or
Horse's mouth home
Forward to
Computers, Brides and Cream Teas

Some other Articles
To join an organisation?
Dramatic Skys at Longleat
Forum help - a push in the right direction
Computers, Brides and Cream Teas
Reporting on the 10 largest files or 10 top scores
Talking about other training companies.
Tomcat - Shutdown port
Build on what you already have with OO
Python - when to use the in operator
Python makes University Challenge
1637 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

© WELL HOUSE CONSULTANTS LTD., 2008: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho