Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
Practical Extraction and Reporting - using Python and Extreme Programming

"We seem to be getting a lot of signups from Germany" - so said my fellow administrator on the First Great Western Coffee Shop forum. At first glance is something of a surprise, as this forum is "provided by a First Great Western Customer, for First Great Western customers" and First Great Western run train services from Paddingon to the West of England and South Wales, with a secondary main line from Portmouth to Cardiff, and regional, local suburban and rural trains on other lines within the same territory, with occasional services venturing as far "off piste" as to Brighton. Nowhere near Germany. So why the interest?

Forums provide an opportunity for people to express their views, add their comments on to others, and post up their information. And as such they can provide a wonderful opportunity for people to get off topic messages onto public readable forums on the Internet. My mailbox contains adverts for pharmaceutical products, get-rich-quick schemes, Books on Steve Jobs (this week), overseas graduate programs, Crocuses, Home Security Systems, dating services, airline tickets and more ... and given half a chance, these same people who, unsolicited, pester me by email would love to advertise on the forum and pester people there too. To keep the wood visible amongst the trees, we limit signups on "The Coffeeshop" to those people who have a genuine interest, and who will post about the issues for which the forum exists. We still get plenty of requests for signup, but our vetting process is such that very few of the "spammers" or rather Wannabe Spammers actually manage to get as far as posting. But it's wasteful of our time, and we're always looking to improve our tools to help us spot the spammers quickly; recently, I added in extra logging of signup requests to help us look at them in a "pageview" mode, and we've now come to the reporting requirement to look at the data that's building up to help keep us even better informed for the future.

So ... the specification for the program and of the requirement looks a bit wooly. And I decided to apply some of the techniques of "Extreme Programming" to the task - writing a short story as to what we wanted - "We would like to be able to count up how many spanners come from wehere so that we can tell which places are the worst / most likely" and then tackle it through a spike solution where I wrote experimental code to see how an answer would look. I selected Python for the task (an excellent language for the job, and the language I've been teaching this week) ... and off I headed.

The story turns out to be, as I start coding, to convert data such as:
  1 LV Haus finanzieren andrahartwick@gmail.com 91.224.246.15 Thu, 13 Oct 2011 06:26:34 +0100
  1 CN cabinet519 zhaominyu15@163.com 113.231.181.142 Thu, 13 Oct 2011 06:26:44 +0100 Shenyang

into results like:
  RU 41 Russian Federation
  CN 38 China
  DE 34 Germany
  US 17 United States
  UA 16 Ukraine
  PL 9 Poland
  LV 8 Latvia
  etc


and then expands that if necessary (in fact a separete "story") by zone:
  CN 38 China
      Beijing 18
      [unknown] 4
      Guangzhou 4
      Putian 3
      Shenyang 2
      Shanghai 2
      Jinan 2
      Nanjing 1
      Wuhan 1
      Qingdao 1


Now that I have got to that point in my exploration of the data, if I needed more I would be refactoring - taking what I have learned and recoding it to make it maintainable. You can see the code [here] with some quite notable comments pointing out its shortcomings ready for the refacoring exercise if that even comes (and if you want to run the program yourself, there's a data sample [here]

I'm sharing this example on our web site under our "Data Munging in Python" heading - for even in its raw form it's a good example of some of the techniques commonly used ... in the source, you'll find coding samples of:
• Regular Expressions (to match patterns in data and extract from them)
• Command Line handling (we've used a -v option to select the versbose / by city report)
• Dictionaries (to keep count by countries as we read the data file
• The urllib2 module (to read a web page from a remote server - the ISO country code lookup!)
• Checking whether a file exists (via os.path.exists)
• routing non-data output so stderr (via sys.syderr)
lambda (to provide single line functions)
read (to slurp an entire file into a variable)
title (to take a country name that's SHOUTED AT YOU and reduce it to more manageable speech!)

Truely, so much of the power of any language comes not so much from the power of individual features, but rather from the power of using them in combination, and from reseaching, refactoring and reusing the code that uses those features.

(written 2011-10-14)

 
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
G903 - Well House Consultants - Running and moderating forums and social media sites
  [4315] Welcoming genuine forum posters quickly - but turning away off topic advertisers - (2014-11-16)
  [4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
  [4283] Can a legitimate forum post become illegal a year later? - (2014-07-11)
  [4239] Facebook marketing - early experiences - (2014-01-19)
  [4234] Change to Libel and Defamation laws from 1st January 2014 - (2013-12-31)
  [4065] Handling requests to a forum - the background process - (2013-04-17)
  [4025] Backups, Codebase, Strategy and more - dealing with forum incidents - (2013-03-03)
  [4017] Acceptable User Policy / vexatious interacter - (2013-02-24)
  [3910] Identifying your real customers and keeping them well informed fast - (2012-11-02)
  [2820] Netiquette for forum newcomers - (2010-06-20)
  [2781] The 500 pound question to get you started - (2010-05-26)
  [2569] How to run a successful online poll / petition / survey / consultation - (2010-01-10)
  [2527] Flying tonight - (2009-12-05)
  [2526] A reluctance to move from old shoes to new - (2009-12-05)
  [2386] Computing under the influence of alcohol - (2009-08-29)
  [2254] Forum membership - a privilege not a right - (2009-06-22)
  [2177] Preventing forum spam - checks at sign up - (2009-05-12)
  [2162] Admins thoughts on banning a member from a forum - (2009-05-09)
  [2156] Stopping forum spam - control of the signup process - (2009-05-04)
  [2116] Why do we delay new forum members through authorisation? - (2009-04-03)
  [2103] Ask the Tutor - Open Source forum - (2009-03-25)
  [1972] Pettifog and forum boards away from public view - (2009-01-03)
  [1923] Making it all worthwhile - (2008-12-04)
  [1759] While the world sleeps ... - (2008-08-19)
  [1678] Software - changes and delays. But courses must run on time! - (2008-06-15)
  [1595] First Great Western Weekend - (2008-03-30)
  [1578] Please don't shout at me! - (2008-03-16)
  [1569] I dont care - goodbye - (2008-03-09)
  [1563] Guidlines for posting on a forum - (2008-03-04)
  [1539] A forum is not always the best vehicle - (2008-02-14)
  [1532] Comment spam blocked. Please comment via Forums - (2008-02-05)
  [1523] Ive just received an email from myself. Should I be worried? - (2008-01-29)
  [1485] Copyright and theft of images, bandwidth and members. - (2007-12-26)
  [1472] The Horse goes on and on - (2007-12-15)
  [1362] No Thank You - (2007-09-23)
  [1190] Save the Forum - A regular clean sweep - (2007-05-17)
  [1088] Why use BBC code not HTML? - (2007-02-21)
  [948] Running an on line campaign - (2006-11-27)
  [923] Why shouldn't I spam? - (2006-11-13)
  [919] Freedom for X is denial of privacy for Y - (2006-11-09)
  [841] Forum help - a push in the right direction - (2006-08-21)
  [828] Freedom of speech and freedom to post - (2006-08-10)
  [806] Check your user is human. Have him retype a word in a graphic - (2006-07-17)
  [651] Please Register with Opentalk - but just once! - (2006-03-19)
  [516] Open source questions? Anyone can ask. - (2005-12-03)
  [424] How not to run a forum - (2005-08-24)
  [248] Use me, but use me effectively - (2005-03-16)
  [231] Feedback as lifeblood - (2005-02-28)
  [204] The confidence to allow public comments - (2005-02-06)
  [130] Spelling and grammar - (2004-11-25)
  [115] Expiration dates or times on web pages - (2004-11-12)
  [29] Silence is Golden - (2004-08-26)
  [22] Falling out over the silliest things - (2004-08-21)

Y117 - Python - Already written modules
  [4086] Cacheing class for Python - using a local SQLite database as a key/value store - (2013-05-14)
  [4085] JSON from Python - first principles, easy example - (2013-05-13)
  [3465] How can I do an FTP transfer in Python? - (2011-10-05)
  [3442] A demonstration of how many Python facilities work together - (2011-09-16)
  [2931] Syncronise - software, trains, and buses. Please! - (2010-08-22)
  [2890] Dates and times in Python - (2010-07-27)
  [2506] Good example of recursion in Python - analyse an RSS feed - (2009-11-18)
  [2020] Learning Python - many new example programs - (2009-01-31)

Y201 - Python for DataMunging and System Admin
  [4211] Handling JSON in Python (and a csv, marshall and pickle comparison) - (2013-11-16)
  [4088] Some tips and techniques for huge data handling in Python - (2013-05-15)


Back to
Testing your Python classes with the unittest package - how to
Previous and next
or
Horse's mouth home
Forward to
Direct Message: Really horrible blog about you ... a clever phishing trip, said to be from an MP
Some other Articles
Canals, watererways in the Melksham area
Taking a boat down Caen Hill Locks
Some thoughts in answer to some Melksham Campus questions
Direct Message: Really horrible blog about you ... a clever phishing trip, said to be from an MP
Practical Extraction and Reporting - using Python and Extreme Programming
Testing your Python classes with the unittest package - how to
Choosing your Python GUI - wx, Qt, Tk or GTK?
Tkinter - an easy to use Python Graphic User Interface - introductory examples
Havant - Shop Frontages.
Python Packages - groupings of modules. An introduction
4320 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/3479_Pra ... mming.html • PAGE BUILT: Thu Sep 18 15:30:25 2014 • BUILD SYSTEM: WomanWithCat