Practical Extraction and Reporting

Edinburgh Commuters on the busThe week just gone, I gave a Perl course in Edinburgh to a dozen very bright scientists, working in the bionformatics field, where they're concerned with analysing a veritable flood of data. A great deal of that analysis is fairly standard and so will be done using standard tools - some written in C for sheer speed of operations, some as Perl modules, and some using R - [link]. But there's also strong layers of experimental data extraction - producing tables of filtered information from incoming flows, and gluewhere work where those are a variety of flows. Inputs can be plain file, SQL feeds, web pages, and files saved from a spreadhseet such as Excel.

As a computer scientist, I can flip between different data sets easily, shrugging off the differences with "it's just data" - but the newcomer needs to learn to apply the same principle very widely. So I used a varierty of data sets during the course. Where can you find them?

http://www.wellho.net/data/access_log.xyz - Web server access log records
http://www.wellho.net/data/requests.xyz - Staff and skills data
http://www.wellho.net/data/cpg.xyz - Bioinformatic data
http://www.wellho.net/data/refflat.xyz - Bioinformatic data
http://www.wellho.net/data/railstats.xyz - Railway station location and use

Some of these have space characters between the fields, others have tabs. Some of them then have other data within a single field that is split by commas. Provided yoou can work out what the data means (and that's wehre you may need a reference manual for the data, and a knowledge of the what the data's about and what is a sensible thing to try), you're then in a good position to engineer it into variables in your perl program for analysis.

Reading from a file this week I used open and <>. Reading from a database, I used DBI -> connect then prepare, execute and a loop of fetchrows. Reading from a website, I used LWP::Simple and get. Just examples of how it can be done, of course, as with Perl "There's more than one way to do it". And once the "top bit" of the code has been written to get the data, the source type becomes irellavant and we can go on to processing the data ...

Many typical programs in this data flow / processing world involve sucking the data in - it can be done all at once for a small data set, or "drip, drip" - line by line - if the incoming set is huge. If the program takes the form of a filter, then information / results can be output as the data is being read in, but if the data is to be output sorted in some way, then it needs to be retained. You'll use collection variables in Perl to do this - either lists (which start with an @ symbol and have numbered positions) or hashes (which start with a % symbol and have named positions). Once you've read all the data, you can sort your lists prior to output. You cannot sort hashes ... but you can (and often do) sort a list of the keys.

And when you're done ... you can output to the screen. Or to a file. Or to a(nother) database. If you're runing on a web server, you can send the output to your user's browser too. We looked very briefly at CGI this week, but there are other ways too.

Let's see a short example - and this is a general one, using data which most people should be able to identify with. I have a data file containing a list of all my staff members- one per line. And after each of their names, on the same line, is a list of the subjects they would like to learn. So:
  ivan Ruby Java Perl Tcl/Tk MySQL
  nigel PHP Python Java Perl
  jenny XML Perl Ruby ASP
  kerry Perl Tcl/Tk Ruby MySQL

And I want to produce a list of subject, and against each subject a list of the people who's top (first) choice it is for them to learn - thus:
  Java - graham rupert ulsyees venus xena
  MySQL - ethel fred olivia orpheus uva zachary
  PHP - harry hazel john leane nigel peter rita xavier

and so on.

The Perl code turns out to be short ... I can read the data in with the following:

  open FH,"requests.xyz" or die "Input data file not available\n";
  while ($line = <FH>) {
    @flds = split(/\s+/,$line);
    push @{$skill{$flds[1]}},$flds[0];

And I can produce my output like this:

  @skilllist = sort keys(%skill);
  oreach $sk (@skilllist) {
    @ordered_names = sort(@{$skill{$sk}});
    print "$sk - @ordered_names\n";

With code that's this short, a skilled Perl programmer can work very quickly indeed - hidden within the terse instructions are a lot of cleverness, and hidden within Perl are further levels of celverness which leave the language to do all the hard work internal to sorting the answers, and to allowing the program to work and have enough memory available no matter how many staff are involved. My sample data file had 52 people and about 10 skills in it. But it would work with 500,000 people and 100,000 skills just as well.

But quick to code doesn't necessarily mean quick and easy to learn and indeed there's something of a tradeoff. Other languages (and other styles of Perl coding, come to that) may be easier to learn, but will from then onwards be somewhat slower in code development. So this is why our Perl Courses are a little longer than our other programming courses, and yet the Perl language continues to be so very popular.

Illustration - travelling to work in Edinburgh
(written 2011-06-26)

