PROCESSING LARGE QUANTITIES OF DATA
"Data Monging" is a term that has come to be used for processing quantities of data - reformatting, extraction, etc. It's really what Perl's ALL about - the language has a number of features which make it especially good for the purpose. In this module, we highlight one or two of the more specialist of these features.
ITERATING OVER DATA IN PERL
So you want to do the same thing to every element of an array?
In traditional languages which are not so full-featured as Perl, you'll use a loop, with a keyword such as "while" or "for", and a variable that steps up from 0 or 1 to the length of the array. You can do the same sort of thing in Perl as well if you wish:
#!/usr/bin/perl
# Using loops to pass through an "array"
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
for ($k=0; $k<@tab; $k++) {
printf("%4d\n",$tab[$k]);
}
Which gives:
$ ./oldfash
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
$
Although you can use the traditional approach in Perl, there are other approaches too, which are often easier to code and more efficient at run time. You should remember that languages like C and Fortran had ARRAYS which were basic containers for a whole lot of variables all of the same type, whereas Perl uses LISTS which are much more flexible structures upon which operations can be performed in their own rights.
Have a look at these two, both of which perform exactly the same as the example above:
#!/usr/bin/perl
# A better iteration through a list
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
printf("%4d\n",$_) for (@tab);
#!/usr/bin/perl
# Another iteration through a list
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
map {printf("%4d\n",$_) } @tab;
= for (or foreach) can be used as an iterator to pass through each element of a list, performing a statement or block on each.
= map iterates through each element of a list, performing a statement on each and returning the result of each statement into a new list
= grep iterates through each element of a list, performing a test on each and returns a list of all the elements for which a true value was returned by the tests.
Here's an example program that generates a list (of file names) and then user map, grep and for to modify, select and iterate through that list and its derivatives:
#!/usr/bin/perl
# Read all file names in current directory
opendir (DH,".");
@indir = readdir(DH);
# Sizes of files starting with "q" ...
# Get the file names
@qfiles = grep(/^q/,@indir);
# Get the sizes of those files
@fsizes = map {-s} @qfiles;
print "q file sizes: @fsizes\n";
# Largest 10 files in the directory
@ftable = map {[$_,-s]} @indir;
@fts = sort {$$b[1]-$$a[1]} @ftable;
@ft10 = @fts[0..9];
printf ("%8d %s\n",$$_[1],$$_[0]) for (@ft10);
in operation:
$ filedata
q file sizes: 376 375 450 506
4964352 URL.txt
605455 access_log
353968 std.list.w.html
313472 stdrandom
291408 wac
206662 words
175426 317l18contigs6899.txt
140366 phone.list.w.html
114224 postcodes.html
114112 postcodes
$
To help you understand what's happening, we wrote that example to create a number of temporary lists, using a single map or grep in each Perl statement. The code could be shortened and will run more efficiently:
#!/usr/bin/perl
opendir (DH,".");
# Sizes of files starting with "q" ...
@fsizes = map {-s} grep /^q/,(@indir = readdir DH);
print "q file sizes: @fsizes\n";
# Largest 10 files in the directory
printf ("%8d %s\n",$$_[1],$$_[0]) for
(sort {$$b[1]-$$a[1]} map {[$_,-s]} @indir)[0..9];
PROCESSING DATA THROUGH REGULAR EXPRESSIONS
You may be used to using Perl's regular expressions to match patterns and extract matches from a line of text, perhaps stepping through all the lines of a file. Have you ever thought of using it on the whole contents of a file at one go? You can do so, provided you're sure that the file won't be so large that you'll fill your computer's memory / swap space.
Here's and example that reads a UK postcode file, and prints out in reverse order the names of all postal towns that come under the main Aberdeen office:
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 1 - conventional programming techniques
open (FH,"postcodes") ;
while ($line = <FH>) {
if ($line =~ /Aberdeen$/) {
push @aber,$line;
}
}
for ($k=@aber;$k>=0;$k--) {
print $aber[$k];
}
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 2 - selection with grep
open (FH,"postcodes") ;
@aber = grep(/Aberdeen$/,<FH>);
print reverse @aber;
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 3 - using regular expressions
open (FH,"postcodes") ;
read (FH,$full, -s "postcodes");
@aber = ($full =~ /.*Aberdeen$/mg);
print (join("\n",reverse @aber),"\n");
In all cases, the results look like this:
$ pc1 (or pc2 or pc3)
Turriff Aberdeenshire AB3 Aberdeen
Strathdon Aberdeenshire AB3 Aberdeen
Stonehaven Aberdeenshire AB3 Aberdeen
Skene Aberdeenshire AB3 Aberdeen
Peterhead Aberdeenshire AB4 Aberdeen
Macduff Banffshire AB4 Aberdeen
Laurencekirk Kincardineshire AB3 Aberdeen
Keith Banffshire AB5 Aberdeen
Inverurie Aberdeenshire AB5 Aberdeen
Insch Aberdeenshire AB5 Aberdeen
Huntly Aberdeenshire AB5 Aberdeen
Fraserburgh Aberdeenshire AB4 Aberdeen
Ellon Aberdeenshire AB4 Aberdeen
Craigellachie Banffshire AB3 Aberdeen
Buckie Banffshire AB5 Aberdeen
Braemar Aberdeenshire AB3 Aberdeen
Banff Banffshire AB4 Aberdeen
Banchory Kincardineshire AB3 Aberdeen
Ballindalloch Aberdeenshire AB3 Aberdeen
Ballater Aberdeenshire AB3 Aberdeen
Alford Aberdeenshire AB3 Aberdeen
Aboyne Aberdeenshire AB3 Aberdeen
Aberlour Banffshire AB3 Aberdeen
ABERDEEN Aberdeenshire AB1,2 Aberdeen
$
See also
Perl for Larger Projects
Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via
our main web site