Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Another large text file example

Posted by admin (Graham Ellis), 15 November 2002
Looks like large text files are "flavour of the week".

I was set the task of extracting all the records from a file containing some 13 million records that matched any one of some 200 different character strings in a certain field in the record.   It's not an uncommon sort of requirement, but it's not something that appears to be trivial when you first see it.  Do you sort the incoming file?  No way!  Do you traverse the incoming file 200 times? Preferably not!  Do you keep a list of all the strings you want to find, and loop through that list for each input line from the file?  No - that would be too slow too.

Solution - set up a hash which has keys for each of the strings you want to find - set the values to any true value that you like.   You can then parse your incoming data file and simply check whether a hash element exists to see if you want a record to be echoed on the output.

Here's a sample program that I wrote for filtering web site addresses and reporting on those hosted on certain shared servers:
Code:
# Set up a hash describing the records we want

%wanted = (
       "members.aol.com" => 1,
       "www.geocities.com" => 1,
       "groups.yahoo.com" => 1,
       "home.earthlink.net" => 1
       );

# Parse the incoming data looking for matches

while ($line = <>) {
       ($server) = ($line =~ m!//(.*?)/!);
       if ($wanted{$server}) {
               print $line;
               }
       }


To give you an idea, filtering a file with 3 million Urls in it (113 Mbytes), this program took 2 minutes on one of my training laptops to produce me an output file of 5 Mbytes containing around 113000 Urls.




This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho