Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2021 - online Python 3 training - see ((here)).

Our plans were to retire in summer 2020 and see the world, but Coronavirus has lead us into a lot of lockdown programming in Python 3 and PHP 7.
We can now offer tailored online training - small groups, real tutors - works really well for groups of 4 to 14 delegates. Anywhere in the world; course language English.

Please ask about private 'maintenance' training for Python 2, Tcl, Perl, PHP, Lua, etc.
Log file analysis - short Python example

HEAVY DATA EXTRACTION IN PYTHON

Python's a great language for writing "testbed" applications - things which start small with a few lines of experimental code and then grow. Here's an example script that I wrote to answer some specific questions concerning access to our course description directory on our web server, where we get a new log file several megabytes long each day and it can be hard to see the data you might be looking for.

Here are three sample lines from our logs ... that's out of 700000 lines in January [2005]

cleveland.directrouter.com - - [01/Jan/2005:23:57:05 -0800] "GET /fotos/lenew.jpg HTTP/1.0" 200 1241 "-" "-"
cr02.bloglines.com - - [07/Jan/2005:00:22:56 -0800] "GET /horse/index.rdf HTTP/1.1" 304 0 "-" "Bloglines/2.0 (http://www.bloglines.com; 1 subscriber)"
lc4.learn.ac.lk - - [07/Jan/2005:00:23:04 -0800] "GET /course/mqfull.html HTTP/1.0" 200 30843 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

And I'm interested in knowing how many lines related to each group of hosts and which particular page(s) they asked for in the /course directory. Here's part of the results

hinet.net alfull alfull alfull ajfull 441
hispeed.ch alfull ctc alfull 438
on.net lafull alfull alfull alfull jxfull 434
auna.net ajfull yp atfull tkfull tbfull 422
netcabo.net atfull mqfull pofull pofull 420
online.no pl atfull 415
af.mil atfull mqfull yp alfull tbfull tbfull tb index linux 414
(etc)
ac.lk lbfull mq otc otc mqfull 16

In that example, the "mqfull" on the last line was generated as a result of the last of the sample records listed above.

Enter Python. The whole thing can be done with a simple script written in a few minutes

#!/usr/bin/python

""" Course page lister. Looks at all file starting with ac and
treats it as an standard web log file. Reports on all domains that
have visited a course description page"""

import os
import re
from hname import *

shorttable = {}
ctable = {}
lookfor = re.compile("GET\s/course/([a-z]+)")

for filename in os.listdir("."):
   if not filename.startswith("ac"):
      continue # eliminate files that are not logfiles
   for line in open(filename).xreadlines():
      parts = line.split(" ")
# How much of the name / IP address do we report?
      host = hname(parts[0])
      dpsumm = host.getShort()
# Count the host usage and log any course files
      shorttable[dpsumm] = 1 + shorttable.get(dpsumm,0)
      gotten = lookfor.findall(line)
      if (not gotten): continue
      ctable[dpsumm] = ctable.get(dpsumm,"") + " " + gotten[0]

def byhits(one,two):
   global shorttable
   return shorttable[two].__cmp__(shorttable[one])

visitors = ctable.keys()
visitors.sort(byhits)

for browser in visitors:
   print browser,ctable[browser],shorttable[browser]


Now that's not the whole story. Chances are that I'll want to reuse across several programs the code to work out how much of a computer name / IP address to display - how to group the hits. So I've put the code for that into a separate file that I can read in within this application (and also within others). In other words, I'm using a class to avoid repetitive coding and to make more efficient use of my coding time. Here's the class

class hname:
   def __init__(self,text):
      self.dparts = text.split(".")
      if self.dparts[-1].isdigit():
         self.havename = 0
      else:
         self.havename = 1
   def getShort(self):
      if self.havename:
         if self.dparts[-1] == "uk":
           dpp = self.dparts[-3:]
           if dpp[0] == "demon":
              dpp = self.dparts[-4:]
         else:
            dpp = self.dparts[-2:]
      else:
           dpp = self.dparts[:]
      return ".".join(dpp)


See also Python programming course

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Python - Class Interaction - an example
  [236] - ()
  [964] - ()
  [3442] - ()

Web Application Deployment - Apache httpd - log files and log tools
  [376] - ()
  [1237] - ()
  [1503] - ()
  [1598] - ()
  [1656] - ()
  [1761] - ()
  [1780] - ()
  [1796] - ()
  [3015] - ()
  [3019] - ()
  [3027] - ()
  [3087] - ()
  [3443] - ()
  [3447] - ()
  [3491] - ()
  [3554] - ()
  [3670] - ()
  [3974] - ()
  [3984] - ()
  [4307] - ()
  [4404] - ()
  [4491] - ()

Python for DataMunging and System Admin
  [3479] - ()
  [4088] - ()
  [4211] - ()
  [4438] - ()

resource index - Python
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2022: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/python-l ... ample.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard