PROCESSING LARGE QUANTITIES OF DATA
"Data Monging" is a term that has come to be used for processing quantities of data - reformatting, extraction, etc. It's really what Perl's ALL about - the language has a number of features which make it especially good for the purpose. In this module, we highlight one or two of the more specialist of these features.
ITERATING OVER DATA IN PERL
So you want to do the same thing to every element of an array?
In traditional languages which are not so full-featured as Perl, you'll use a loop, with a keyword such as "while" or "for", and a variable that steps up from 0 or 1 to the length of the array. You can do the same sort of thing in Perl as well if you wish:
#!/usr/bin/perl
# Using loops to pass through an "array"
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
for ($k=0; $k<@tab; $k++) {
printf("%4d\n",$tab[$k]);
}
Which gives:
$ ./oldfash
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
$
Although you can use the traditional approach in Perl, there are other approaches too, which are often easier to code and more efficient at run time. You should remember that languages like C and Fortran had ARRAYS which were basic containers for a whole lot of variables all of the same type, whereas Perl uses LISTS which are much more flexible structures upon which operations can be performed in their own rights.
Have a look at these two, both of which perform exactly the same as the example above:
#!/usr/bin/perl
# A better iteration through a list
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
printf("%4d\n",$_) for (@tab);
#!/usr/bin/perl
# Another iteration through a list
$tab[0] = $tab[1] = 1;
for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}
map {printf("%4d\n",$_) } @tab;
= for (or foreach) can be used as an iterator to pass through each element of a list, performing a statement or block on each.
= map iterates through each element of a list, performing a statement on each and returning the result of each statement into a new list
= grep iterates through each element of a list, performing a test on each and returns a list of all the elements for which a true value was returned by the tests.
Here's an example program that generates a list (of file names) and then user map, grep and for to modify, select and iterate through that list and its derivatives:
#!/usr/bin/perl
# Read all file names in current directory
opendir (DH,".");
@indir = readdir(DH);
# Sizes of files starting with "q" ...
# Get the file names
@qfiles = grep(/^q/,@indir);
# Get the sizes of those files
@fsizes = map {-s} @qfiles;
print "q file sizes: @fsizes\n";
# Largest 10 files in the directory
@ftable = map {[$_,-s]} @indir;
@fts = sort {$$b[1]-$$a[1]} @ftable;
@ft10 = @fts[0..9];
printf ("%8d %s\n",$$_[1],$$_[0]) for (@ft10);
in operation:
$ filedata
q file sizes: 376 375 450 506
4964352 URL.txt
605455 access_log
353968 std.list.w.html
313472 stdrandom
291408 wac
206662 words
175426 317l18contigs6899.txt
140366 phone.list.w.html
114224 postcodes.html
114112 postcodes
$
To help you understand what's happening, we wrote that example to create a number of temporary lists, using a single map or grep in each Perl statement. The code could be shortened and will run more efficiently:
#!/usr/bin/perl
opendir (DH,".");
# Sizes of files starting with "q" ...
@fsizes = map {-s} grep /^q/,(@indir = readdir DH);
print "q file sizes: @fsizes\n";
# Largest 10 files in the directory
printf ("%8d %s\n",$$_[1],$$_[0]) for
(sort {$$b[1]-$$a[1]} map {[$_,-s]} @indir)[0..9];
PROCESSING DATA THROUGH REGULAR EXPRESSIONS
You may be used to using Perl's regular expressions to match patterns and extract matches from a line of text, perhaps stepping through all the lines of a file. Have you ever thought of using it on the whole contents of a file at one go? You can do so, provided you're sure that the file won't be so large that you'll fill your computer's memory / swap space.
Here's and example that reads a UK postcode file, and prints out in reverse order the names of all postal towns that come under the main Aberdeen office:
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 1 - conventional programming techniques
open (FH,"postcodes") ;
while ($line = <FH>) {
if ($line =~ /Aberdeen$/) {
push @aber,$line;
}
}
for ($k=@aber;$k>=0;$k--) {
print $aber[$k];
}
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 2 - selection with grep
open (FH,"postcodes") ;
@aber = grep(/Aberdeen$/,<FH>);
print reverse @aber;
#!/usr/bin/perl
# read in and locate appropriate postcodes
# version 3 - using regular expressions
open (FH,"postcodes") ;
read (FH,$full, -s "postcodes");
@aber = ($full =~ /.*Aberdeen$/mg);
print (join("\n",reverse @aber),"\n");
In all cases, the results look like this:
$ pc1 (or pc2 or pc3)
Turriff Aberdeenshire AB3 Aberdeen
Strathdon Aberdeenshire AB3 Aberdeen
Stonehaven Aberdeenshire AB3 Aberdeen
Skene Aberdeenshire AB3 Aberdeen
Peterhead Aberdeenshire AB4 Aberdeen
Macduff Banffshire AB4 Aberdeen
Laurencekirk Kincardineshire AB3 Aberdeen
Keith Banffshire AB5 Aberdeen
Inverurie Aberdeenshire AB5 Aberdeen
Insch Aberdeenshire AB5 Aberdeen
Huntly Aberdeenshire AB5 Aberdeen
Fraserburgh Aberdeenshire AB4 Aberdeen
Ellon Aberdeenshire AB4 Aberdeen
Craigellachie Banffshire AB3 Aberdeen
Buckie Banffshire AB5 Aberdeen
Braemar Aberdeenshire AB3 Aberdeen
Banff Banffshire AB4 Aberdeen
Banchory Kincardineshire AB3 Aberdeen
Ballindalloch Aberdeenshire AB3 Aberdeen
Ballater Aberdeenshire AB3 Aberdeen
Alford Aberdeenshire AB3 Aberdeen
Aboyne Aberdeenshire AB3 Aberdeen
Aberlour Banffshire AB3 Aberdeen
ABERDEEN Aberdeenshire AB1,2 Aberdeen
$
See also
Perl for Larger Projects
Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via
our main web site
Perl - Lists [2484] Finding text and what surrounds it - contextual grep - (2009-10-30)
[2295] The dog is not in trouble - (2009-07-17)
[2226] Revision / Summary of lists - Perl - (2009-06-10)
[2067] Perl - lists do so much more than arrays - (2009-03-05)
[1918] Perl Socket Programming Examples - (2008-12-02)
[1917] Out of memory during array extend - Perl - (2008-12-02)
[1828] Perl - map to process every member of a list (array) - (2008-10-09)
[1703] Perl ... adding to a list - end, middle, start - (2008-07-09)
[1316] Filtering and altering Perl lists with grep and map - (2007-08-23)
[1304] Last elements in a Perl or Python list - (2007-08-16)
[968] Perl - a list or a hash? - (2006-12-06)
[928] C++ and Perl - why did they do it THAT way? - (2006-11-16)
[773] Breaking bread - (2006-06-22)
[762] Huge data files - what happened earlier? - (2006-06-15)
[622] Queues and barrel rolls in Perl - (2006-02-24)
[560] The fencepost problem - (2006-01-10)
[463] Splitting the difference - (2005-10-13)
[355] Context in Perl - (2005-06-22)
[240] Conventional restraints removed - (2005-03-09)
[230] Course sizes - beware of marketing statistics - (2005-02-27)
[140] Comparison Chart for Perl programmers - list functions - (2004-12-04)
[28] Perl for breakfast - (2004-08-25)
Perl - Handling Huge Data [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
[1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
[1920] Progress Bar Techniques - Perl - (2008-12-03)
[1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
[975] Answering ALL the delegate's Perl questions - (2006-12-09)
[762] Huge data files - what happened earlier? - (2006-06-15)
[639] Progress bars and other dynamic reports - (2006-03-09)
Perl - Data Munging [2129] Nothing beats Perl to solve a data manipulation requirement quickly - (2009-04-14)
[1947] Perl substitute - the e modifier - (2008-12-16)
[1509] Extracting information from a file of records - (2008-01-16)
[1316] Filtering and altering Perl lists with grep and map - (2007-08-23)
[597] Storing a regular expression in a perl variable - (2006-02-09)
resource index - Perl
Solutions centre home page
You'll find shorter technical items at
The Horse's Mouth and
delegate's questions answered at
the
Opentalk forum.
At Well House Consultants, we provide
training courses on
subjects such as Ruby, Lua, Perl, Python, Linux, C, C++,
Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer)
many questions, and answers to those which are of general
interest are published in this area of our site.