Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
Python and Tcl - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Lua, etc
 
Log file analysis - short Python example

HEAVY DATA EXTRACTION IN PYTHON

Python's a great language for writing "testbed" applications - things which start small with a few lines of experimental code and then grow. Here's an example script that I wrote to answer some specific questions concerning access to our course description directory on our web server, where we get a new log file several megabytes long each day and it can be hard to see the data you might be looking for.

Here are three sample lines from our logs ... that's out of 700000 lines in January [2005]

cleveland.directrouter.com - - [01/Jan/2005:23:57:05 -0800] "GET /fotos/lenew.jpg HTTP/1.0" 200 1241 "-" "-"
cr02.bloglines.com - - [07/Jan/2005:00:22:56 -0800] "GET /horse/index.rdf HTTP/1.1" 304 0 "-" "Bloglines/2.0 (http://www.bloglines.com; 1 subscriber)"
lc4.learn.ac.lk - - [07/Jan/2005:00:23:04 -0800] "GET /course/mqfull.html HTTP/1.0" 200 30843 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

And I'm interested in knowing how many lines related to each group of hosts and which particular page(s) they asked for in the /course directory. Here's part of the results

hinet.net alfull alfull alfull ajfull 441
hispeed.ch alfull ctc alfull 438
on.net lafull alfull alfull alfull jxfull 434
auna.net ajfull yp atfull tkfull tbfull 422
netcabo.net atfull mqfull pofull pofull 420
online.no pl atfull 415
af.mil atfull mqfull yp alfull tbfull tbfull tb index linux 414
(etc)
ac.lk lbfull mq otc otc mqfull 16

In that example, the "mqfull" on the last line was generated as a result of the last of the sample records listed above.

Enter Python. The whole thing can be done with a simple script written in a few minutes

#!/usr/bin/python

""" Course page lister. Looks at all file starting with ac and
treats it as an standard web log file. Reports on all domains that
have visited a course description page"""

import os
import re
from hname import *

shorttable = {}
ctable = {}
lookfor = re.compile("GET\s/course/([a-z]+)")

for filename in os.listdir("."):
   if not filename.startswith("ac"):
      continue # eliminate files that are not logfiles
   for line in open(filename).xreadlines():
      parts = line.split(" ")
# How much of the name / IP address do we report?
      host = hname(parts[0])
      dpsumm = host.getShort()
# Count the host usage and log any course files
      shorttable[dpsumm] = 1 + shorttable.get(dpsumm,0)
      gotten = lookfor.findall(line)
      if (not gotten): continue
      ctable[dpsumm] = ctable.get(dpsumm,"") + " " + gotten[0]

def byhits(one,two):
   global shorttable
   return shorttable[two].__cmp__(shorttable[one])

visitors = ctable.keys()
visitors.sort(byhits)

for browser in visitors:
   print browser,ctable[browser],shorttable[browser]


Now that's not the whole story. Chances are that I'll want to reuse across several programs the code to work out how much of a computer name / IP address to display - how to group the hits. So I've put the code for that into a separate file that I can read in within this application (and also within others). In other words, I'm using a class to avoid repetitive coding and to make more efficient use of my coding time. Here's the class

class hname:
   def __init__(self,text):
      self.dparts = text.split(".")
      if self.dparts[-1].isdigit():
         self.havename = 0
      else:
         self.havename = 1
   def getShort(self):
      if self.havename:
         if self.dparts[-1] == "uk":
           dpp = self.dparts[-3:]
           if dpp[0] == "demon":
              dpp = self.dparts[-4:]
         else:
            dpp = self.dparts[-2:]
      else:
           dpp = self.dparts[:]
      return ".".join(dpp)


See also Python programming course

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Python - Class Interaction - an example
  [3442] A demonstration of how many Python facilities work together - (2011-09-16)
  [964] Practical polymorphism in action - (2006-12-04)
  [236] Tapping in on resources - (2005-03-05)

Web Application Deployment - Apache httpd - log files and log tools
  [4491] Web Server Admin - some of those things that happen, and solutions - (2015-05-10)
  [4404] Which (virtual) host was visited? Tuning Apache log files, and Python analysis - (2015-01-23)
  [4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
  [3984] 20 minutes in to our 15 minutes of fame - (2013-01-20)
  [3974] TV show appearance - how does it effect your web site? - (2013-01-13)
  [3670] Reading Google Analytics results, based on the relative populations of countries - (2012-03-24)
  [3554] Learning more about our web site - and learning how to learn about yours - (2011-12-17)
  [3491] Who is knocking at your web site door? Are you well set up to deal with allcomers? - (2011-10-21)
  [3447] Needle in a haystack - finding the web server overload - (2011-09-18)
  [3443] Getting more log information from the Apache http web server - (2011-09-16)
  [3087] Making the most of critical emails - reading behind the scene - (2010-12-16)
  [3027] Server logs - drawing a graph of gathered data - (2010-11-03)
  [3019] Apache httpd Server Status - monitoring your server - (2010-10-28)
  [3015] Logging the performance of the Apache httpd web server - (2010-10-25)
  [1796] libwww-perl and Indy Library in your server logs? - (2008-09-13)
  [1780] Server overloading - turns out to be feof in PHP - (2008-09-01)
  [1761] Logging Cookies with the Apache httpd web server - (2008-08-20)
  [1656] Be careful of misreading server statistics - (2008-05-28)
  [1598] Every link has two ends - fixing 404s at the recipient - (2008-04-02)
  [1503] Web page (http) error status 405 - (2008-01-12)
  [1237] What proportion of our web traffic is robots? - (2007-06-19)
  [376] What brings people to my web site? - (2005-07-13)

Python for DataMunging and System Admin
  [4438] Loving programming in Python - and ready to teach YOU how - (2015-02-22)
  [4211] Handling JSON in Python (and a csv, marshall and pickle comparison) - (2013-11-16)
  [4088] Some tips and techniques for huge data handling in Python - (2013-05-15)
  [3479] Practical Extraction and Reporting - using Python and Extreme Programming - (2011-10-14)

resource index - Python
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2019: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01225 708225 • FAX: 01225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/python-l ... ample.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard