Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
Python and Tcl - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Lua, etc
 
Data Monging

PROCESSING LARGE QUANTITIES OF DATA

"Data Monging" is a term that has come to be used for processing quantities of data - reformatting, extraction, etc. It's really what Perl's ALL about - the language has a number of features which make it especially good for the purpose. In this module, we highlight one or two of the more specialist of these features.

ITERATING OVER DATA IN PERL

So you want to do the same thing to every element of an array?

In traditional languages which are not so full-featured as Perl, you'll use a loop, with a keyword such as "while" or "for", and a variable that steps up from 0 or 1 to the length of the array. You can do the same sort of thing in Perl as well if you wish:

#!/usr/bin/perl

# Using loops to pass through an "array"

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
 $tab[$k] = $tab[$k-1]+$tab[$k-2];
 }

for ($k=0; $k<@tab; $k++) {
 printf("%4d\n",$tab[$k]);
 }

Which gives:

$ ./oldfash
   1
   1
   2
   3
   5
   8
  13
  21
  34
  55
  89
 144
 233
 377
 610
 987
1597
2584
4181
6765
$

Although you can use the traditional approach in Perl, there are other approaches too, which are often easier to code and more efficient at run time. You should remember that languages like C and Fortran had ARRAYS which were basic containers for a whole lot of variables all of the same type, whereas Perl uses LISTS which are much more flexible structures upon which operations can be performed in their own rights.

Have a look at these two, both of which perform exactly the same as the example above:

#!/usr/bin/perl

# A better iteration through a list

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
 $tab[$k] = $tab[$k-1]+$tab[$k-2];
 }

printf("%4d\n",$_) for (@tab);

#!/usr/bin/perl

# Another iteration through a list

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
 $tab[$k] = $tab[$k-1]+$tab[$k-2];
 }

map {printf("%4d\n",$_) } @tab;

= for (or foreach) can be used as an iterator to pass through each element of a list, performing a statement or block on each.

= map iterates through each element of a list, performing a statement on each and returning the result of each statement into a new list

= grep iterates through each element of a list, performing a test on each and returns a list of all the elements for which a true value was returned by the tests.

Here's an example program that generates a list (of file names) and then user map, grep and for to modify, select and iterate through that list and its derivatives:

#!/usr/bin/perl

# Read all file names in current directory
opendir (DH,".");
@indir = readdir(DH);

# Sizes of files starting with "q" ...

# Get the file names
@qfiles = grep(/^q/,@indir);
# Get the sizes of those files
@fsizes = map {-s} @qfiles;
print "q file sizes: @fsizes\n";

# Largest 10 files in the directory

@ftable = map {[$_,-s]} @indir;
@fts = sort {$$b[1]-$$a[1]} @ftable;
@ft10 = @fts[0..9];
printf ("%8d %s\n",$$_[1],$$_[0]) for (@ft10);

in operation:

$ filedata
q file sizes: 376 375 450 506
 4964352 URL.txt
  605455 access_log
  353968 std.list.w.html
  313472 stdrandom
  291408 wac
  206662 words
  175426 317l18contigs6899.txt
  140366 phone.list.w.html
  114224 postcodes.html
  114112 postcodes
$

To help you understand what's happening, we wrote that example to create a number of temporary lists, using a single map or grep in each Perl statement. The code could be shortened and will run more efficiently:

#!/usr/bin/perl

opendir (DH,".");

# Sizes of files starting with "q" ...

@fsizes = map {-s} grep /^q/,(@indir = readdir DH);
print "q file sizes: @fsizes\n";

# Largest 10 files in the directory

printf ("%8d %s\n",$$_[1],$$_[0]) for
(sort {$$b[1]-$$a[1]} map {[$_,-s]} @indir)[0..9];

PROCESSING DATA THROUGH REGULAR EXPRESSIONS

You may be used to using Perl's regular expressions to match patterns and extract matches from a line of text, perhaps stepping through all the lines of a file. Have you ever thought of using it on the whole contents of a file at one go? You can do so, provided you're sure that the file won't be so large that you'll fill your computer's memory / swap space.

Here's and example that reads a UK postcode file, and prints out in reverse order the names of all postal towns that come under the main Aberdeen office:

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 1 - conventional programming techniques

open (FH,"postcodes") ;

while ($line = <FH>) {
 if ($line =~ /Aberdeen$/) {
  push @aber,$line;
  }
 }
for ($k=@aber;$k>=0;$k--) {
 print $aber[$k];
 }

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 2 - selection with grep

open (FH,"postcodes") ;

@aber = grep(/Aberdeen$/,<FH>);

print reverse @aber;

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 3 - using regular expressions

open (FH,"postcodes") ;
read (FH,$full, -s "postcodes");

@aber = ($full =~ /.*Aberdeen$/mg);

print (join("\n",reverse @aber),"\n");

In all cases, the results look like this:

$ pc1 (or pc2 or pc3)
    Turriff Aberdeenshire AB3 Aberdeen
    Strathdon Aberdeenshire AB3 Aberdeen
    Stonehaven Aberdeenshire AB3 Aberdeen
    Skene Aberdeenshire AB3 Aberdeen
    Peterhead Aberdeenshire AB4 Aberdeen
    Macduff Banffshire AB4 Aberdeen
    Laurencekirk Kincardineshire AB3 Aberdeen
    Keith Banffshire AB5 Aberdeen
    Inverurie Aberdeenshire AB5 Aberdeen
    Insch Aberdeenshire AB5 Aberdeen
    Huntly Aberdeenshire AB5 Aberdeen
    Fraserburgh Aberdeenshire AB4 Aberdeen
    Ellon Aberdeenshire AB4 Aberdeen
    Craigellachie Banffshire AB3 Aberdeen
    Buckie Banffshire AB5 Aberdeen
    Braemar Aberdeenshire AB3 Aberdeen
    Banff Banffshire AB4 Aberdeen
    Banchory Kincardineshire AB3 Aberdeen
    Ballindalloch Aberdeenshire AB3 Aberdeen
    Ballater Aberdeenshire AB3 Aberdeen
    Alford Aberdeenshire AB3 Aberdeen
    Aboyne Aberdeenshire AB3 Aberdeen
    Aberlour Banffshire AB3 Aberdeen
    ABERDEEN Aberdeenshire AB1,2 Aberdeen
$


See also Perl for Larger Projects

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - Lists
  [4609] Mapping an array / list without a loop - how to do it in Perl 6 - (2016-01-03)
  [3939] Lots of ways of doing the same thing in Perl - list iteration - (2012-12-03)
  [3906] Taking the lead, not the dog, for a walk. - (2012-10-28)
  [3870] Writing more maintainable Perl - naming fields from your data records - (2012-09-25)
  [3669] Stepping through a list (or an array) in reverse order - (2012-03-23)
  [3548] Dark mornings, dog update, and Python and Lua courses before Christmas - (2011-12-10)
  [3400] $ is atomic and % and @ are molecular - Perl - (2011-08-20)
  [2996] Copying - duplicating data, or just adding a name? Perl and Python compared - (2010-10-12)
  [2833] Fresh Perl Teaching Examples - part 2 of 3 - (2010-06-27)
  [2813] Iterating over a Perl list and changing all items - (2010-06-15)
  [2484] Finding text and what surrounds it - contextual grep - (2009-10-30)
  [2295] The dog is not in trouble - (2009-07-17)
  [2226] Revision / Summary of lists - Perl - (2009-06-10)
  [2067] Perl - lists do so much more than arrays - (2009-03-05)
  [1918] Perl Socket Programming Examples - (2008-12-02)
  [1917] Out of memory during array extend - Perl - (2008-12-02)
  [1828] Perl - map to process every member of a list (array) - (2008-10-09)
  [1703] Perl ... adding to a list - end, middle, start - (2008-07-09)
  [1316] Filtering and altering Perl lists with grep and map - (2007-08-23)
  [1304] Last elements in a Perl or Python list - (2007-08-16)
  [968] Perl - a list or a hash? - (2006-12-06)
  [928] C++ and Perl - why did they do it THAT way? - (2006-11-16)
  [773] Breaking bread - (2006-06-22)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [622] Queues and barrel rolls in Perl - (2006-02-24)
  [560] The fencepost problem - (2006-01-10)
  [463] Splitting the difference - (2005-10-13)
  [355] Context in Perl - (2005-06-22)
  [240] Conventional restraints removed - (2005-03-09)
  [230] Course sizes - beware of marketing statistics - (2005-02-27)
  [140] Comparison Chart for Perl programmers - list functions - (2004-12-04)
  [28] Perl for breakfast - (2004-08-25)

Perl - Handling Huge Data
  [3375] How to interact with a Perl program while it is processing data - (2011-07-31)
  [3374] Speeding up your Perl code - (2011-07-30)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [2806] Macho matching - do not do it! - (2010-06-13)
  [2805] How are you getting on? - (2010-06-13)
  [2376] Long job - progress bar techniques (Perl) - (2009-08-26)
  [1924] Preventing ^C stopping / killing a program - Perl - (2008-12-05)
  [1920] Progress Bar Techniques - Perl - (2008-12-03)
  [1397] Perl - progress bar, supressing ^C and coping with huge data flows - (2007-10-20)
  [975] Answering ALL the delegate's Perl questions - (2006-12-09)
  [762] Huge data files - what happened earlier? - (2006-06-15)
  [639] Progress bars and other dynamic reports - (2006-03-09)

Perl - Data Munging
  [4620] Perl 6 - a Practical Extraction and Reporting example! - (2016-01-11)
  [3764] Shell, Awk, Perl of Python? - (2012-06-14)
  [3707] Converting codons via Amino Acids to Proteins in Perl - (2012-04-25)
  [3335] Practical Extraction and Reporting - (2011-06-26)
  [2702] First and last match with Regular Expressions - (2010-04-02)
  [2129] Nothing beats Perl to solve a data manipulation requirement quickly - (2009-04-14)
  [1947] Perl substitute - the e modifier - (2008-12-16)
  [1509] Extracting information from a file of records - (2008-01-16)
  [1316] Filtering and altering Perl lists with grep and map - (2007-08-23)
  [597] Storing a regular expression in a perl variable - (2006-02-09)

resource index - Perl
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2019: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01225 708225 • FAX: 01225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/perl-data-monging.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard