| |||||||||||
| |||||||||||
Log file analysis - short Python example
HEAVY DATA EXTRACTION IN PYTHON Python's a great language for writing "testbed" applications - things which start small with a few lines of experimental code and then grow. Here's an example script that I wrote to answer some specific questions concerning access to our course description directory on our web server, where we get a new log file several megabytes long each day and it can be hard to see the data you might be looking for. Here are three sample lines from our logs ... that's out of 700000 lines in January [2005] cleveland.directrouter.com - - [01/Jan/2005:23:57:05 -0800] "GET /fotos/lenew.jpg HTTP/1.0" 200 1241 "-" "-" cr02.bloglines.com - - [07/Jan/2005:00:22:56 -0800] "GET /horse/index.rdf HTTP/1.1" 304 0 "-" "Bloglines/2.0 (http://www.bloglines.com; 1 subscriber)" lc4.learn.ac.lk - - [07/Jan/2005:00:23:04 -0800] "GET /course/mqfull.html HTTP/1.0" 200 30843 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" And I'm interested in knowing how many lines related to each group of hosts and which particular page(s) they asked for in the /course directory. Here's part of the results hinet.net alfull alfull alfull ajfull 441 hispeed.ch alfull ctc alfull 438 on.net lafull alfull alfull alfull jxfull 434 auna.net ajfull yp atfull tkfull tbfull 422 netcabo.net atfull mqfull pofull pofull 420 online.no pl atfull 415 af.mil atfull mqfull yp alfull tbfull tbfull tb index linux 414 (etc) ac.lk lbfull mq otc otc mqfull 16 In that example, the "mqfull" on the last line was generated as a result of the last of the sample records listed above. Enter Python. The whole thing can be done with a simple script written in a few minutes #!/usr/bin/python """ Course page lister. Looks at all file starting with ac and treats it as an standard web log file. Reports on all domains that have visited a course description page""" import os import re from hname import * shorttable = {} ctable = {} lookfor = re.compile("GET\s/course/([a-z]+)") for filename in os.listdir("."): if not filename.startswith("ac"): continue # eliminate files that are not logfiles for line in open(filename).xreadlines(): parts = line.split(" ") # How much of the name / IP address do we report? host = hname(parts[0]) dpsumm = host.getShort() # Count the host usage and log any course files shorttable[dpsumm] = 1 + shorttable.get(dpsumm,0) gotten = lookfor.findall(line) if (not gotten): continue ctable[dpsumm] = ctable.get(dpsumm,"") + " " + gotten[0] def byhits(one,two): global shorttable return shorttable[two].__cmp__(shorttable[one]) visitors = ctable.keys() visitors.sort(byhits) for browser in visitors: print browser,ctable[browser],shorttable[browser] Now that's not the whole story. Chances are that I'll want to reuse across several programs the code to work out how much of a computer name / IP address to display - how to group the hits. So I've put the code for that into a separate file that I can read in within this application (and also within others). In other words, I'm using a class to avoid repetitive coding and to make more efficient use of my coding time. Here's the class class hname: def __init__(self,text): self.dparts = text.split(".") if self.dparts[-1].isdigit(): self.havename = 0 else: self.havename = 1 def getShort(self): if self.havename: if self.dparts[-1] == "uk": dpp = self.dparts[-3:] if dpp[0] == "demon": dpp = self.dparts[-4:] else: dpp = self.dparts[-2:] else: dpp = self.dparts[:] return ".".join(dpp) See also Python programming course Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via our main web site
Related Material
Python - Class Interaction - an example Python for DataMunging and System Admin resource index - Python Solutions centre home page You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum. At Well House Consultants, we provide training courses on subjects such as Ruby, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site. |
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho | |||||||||||