HEAVY DATA EXTRACTION IN PYTHON
Python's a great language for writing "testbed" applications - things which start small with a few lines of experimental code and then grow. Here's an example script that I wrote to answer some specific questions concerning access to our course description directory on our web server, where we get a new log file several megabytes long each day and it can be hard to see the data you might be looking for.
Here are three sample lines from our logs ... that's out of 700000 lines in January [2005]
cleveland.directrouter.com - - [01/Jan/2005:23:57:05 -0800] "GET /fotos/lenew.jpg HTTP/1.0" 200 1241 "-" "-"
cr02.bloglines.com - - [07/Jan/2005:00:22:56 -0800] "GET /horse/index.rdf HTTP/1.1" 304 0 "-" "Bloglines/2.0 (http://www.bloglines.com; 1 subscriber)"
lc4.learn.ac.lk - - [07/Jan/2005:00:23:04 -0800] "GET /course/mqfull.html HTTP/1.0" 200 30843 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
And I'm interested in knowing how many lines related to each group of hosts and which particular page(s) they asked for in the /course directory. Here's part of the results
hinet.net alfull alfull alfull ajfull 441
hispeed.ch alfull ctc alfull 438
on.net lafull alfull alfull alfull jxfull 434
auna.net ajfull yp atfull tkfull tbfull 422
netcabo.net atfull mqfull pofull pofull 420
online.no pl atfull 415
af.mil atfull mqfull yp alfull tbfull tbfull tb index linux 414
(etc)
ac.lk lbfull mq otc otc mqfull 16
In that example, the "mqfull" on the last line was generated as a result of the last of the sample records listed above.
Enter Python. The whole thing can be done with a simple script written in a few minutes
#!/usr/bin/python
""" Course page lister. Looks at all file starting with ac and
treats it as an standard web log file. Reports on all domains that
have visited a course description page"""
import os
import re
from hname import *
shorttable = {}
ctable = {}
lookfor = re.compile("GET\s/course/([a-z]+)")
for filename in os.listdir("."):
if not filename.startswith("ac"):
continue # eliminate files that are not logfiles
for line in open(filename).xreadlines():
parts = line.split(" ")
# How much of the name / IP address do we report?
host = hname(parts[0])
dpsumm = host.getShort()
# Count the host usage and log any course files
shorttable[dpsumm] = 1 + shorttable.get(dpsumm,0)
gotten = lookfor.findall(line)
if (not gotten): continue
ctable[dpsumm] = ctable.get(dpsumm,"") + " " + gotten[0]
def byhits(one,two):
global shorttable
return shorttable[two].__cmp__(shorttable[one])
visitors = ctable.keys()
visitors.sort(byhits)
for browser in visitors:
print browser,ctable[browser],shorttable[browser]
Now that's not the whole story. Chances are that I'll want to reuse across several programs the code to work out how much of a computer name / IP address to display - how to group the hits. So I've put the code for that into a separate file that I can read in within this application (and also within others). In other words, I'm using a class to avoid repetitive coding and to make more efficient use of my coding time. Here's the class
class hname:
def __init__(self,text):
self.dparts = text.split(".")
if self.dparts[-1].isdigit():
self.havename = 0
else:
self.havename = 1
def getShort(self):
if self.havename:
if self.dparts[-1] == "uk":
dpp = self.dparts[-3:]
if dpp[0] == "demon":
dpp = self.dparts[-4:]
else:
dpp = self.dparts[-2:]
else:
dpp = self.dparts[:]
return ".".join(dpp)
See also
Python programming course
Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via
our main web site
Python - Class Interaction - an example [236] - ()
[964] - ()
[3442] - ()
Web Application Deployment - Apache httpd - log files and log tools [376] - ()
[1237] - ()
[1503] - ()
[1598] - ()
[1656] - ()
[1761] - ()
[1780] - ()
[1796] - ()
[3015] - ()
[3019] - ()
[3027] - ()
[3087] - ()
[3443] - ()
[3447] - ()
[3491] - ()
[3554] - ()
[3670] - ()
[3974] - ()
[3984] - ()
[4307] - ()
[4404] - ()
[4491] - ()
Python for DataMunging and System Admin [3479] - ()
[4088] - ()
[4211] - ()
[4438] - ()
resource index - Python
Solutions centre home page
You'll find shorter technical items at
The Horse's Mouth and
delegate's questions answered at
the
Opentalk forum.
At Well House Consultants, we provide
training courses on
subjects such as Ruby, Lua, Perl, Python, Linux, C, C++,
Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer)
many questions, and answers to those which are of general
interest are published in this area of our site.