Data Monging

For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))

Data Monging

PROCESSING LARGE QUANTITIES OF DATA

"Data Monging" is a term that has come to be used for processing quantities of data - reformatting, extraction, etc. It's really what Perl's ALL about - the language has a number of features which make it especially good for the purpose. In this module, we highlight one or two of the more specialist of these features.

ITERATING OVER DATA IN PERL

So you want to do the same thing to every element of an array?

In traditional languages which are not so full-featured as Perl, you'll use a loop, with a keyword such as "while" or "for", and a variable that steps up from 0 or 1 to the length of the array. You can do the same sort of thing in Perl as well if you wish:

#!/usr/bin/perl

# Using loops to pass through an "array"

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}

for ($k=0; $k<@tab; $k++) {
printf("%4d\n",$tab[$k]);
}

Which gives:

$ ./oldfash
   1
   1
   2
   3
   5
   8
  13
  21
  34
  55
  89
144
233
377
610
987
1597
2584
4181
6765
$

Although you can use the traditional approach in Perl, there are other approaches too, which are often easier to code and more efficient at run time. You should remember that languages like C and Fortran had ARRAYS which were basic containers for a whole lot of variables all of the same type, whereas Perl uses LISTS which are much more flexible structures upon which operations can be performed in their own rights.

Have a look at these two, both of which perform exactly the same as the example above:

#!/usr/bin/perl

# A better iteration through a list

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}

printf("%4d\n",$_) for (@tab);

#!/usr/bin/perl

# Another iteration through a list

$tab[0] = $tab[1] = 1;

for ($k = 2; $k<20; $k++) {
$tab[$k] = $tab[$k-1]+$tab[$k-2];
}

map {printf("%4d\n",$_) } @tab;

= for (or foreach) can be used as an iterator to pass through each element of a list, performing a statement or block on each.

= map iterates through each element of a list, performing a statement on each and returning the result of each statement into a new list

= grep iterates through each element of a list, performing a test on each and returns a list of all the elements for which a true value was returned by the tests.

Here's an example program that generates a list (of file names) and then user map, grep and for to modify, select and iterate through that list and its derivatives:

#!/usr/bin/perl

# Read all file names in current directory
opendir (DH,".");
@indir = readdir(DH);

# Sizes of files starting with "q" ...

# Get the file names
@qfiles = grep(/^q/,@indir);
# Get the sizes of those files
@fsizes = map {-s} @qfiles;
print "q file sizes: @fsizes\n";

# Largest 10 files in the directory

@ftable = map {[$_,-s]} @indir;
@fts = sort {$$b[1]-$$a[1]} @ftable;
@ft10 = @fts[0..9];
printf ("%8d %s\n",$$_[1],$$_[0]) for (@ft10);

in operation:

$ filedata
q file sizes: 376 375 450 506
4964352 URL.txt
  605455 access_log
  353968 std.list.w.html
  313472 stdrandom
  291408 wac
  206662 words
  175426 317l18contigs6899.txt
  140366 phone.list.w.html
  114224 postcodes.html
  114112 postcodes
$

To help you understand what's happening, we wrote that example to create a number of temporary lists, using a single map or grep in each Perl statement. The code could be shortened and will run more efficiently:

#!/usr/bin/perl

opendir (DH,".");

# Sizes of files starting with "q" ...

@fsizes = map {-s} grep /^q/,(@indir = readdir DH);
print "q file sizes: @fsizes\n";

# Largest 10 files in the directory

printf ("%8d %s\n",$$_[1],$$_[0]) for
(sort {$$b[1]-$$a[1]} map {[$_,-s]} @indir)[0..9];

PROCESSING DATA THROUGH REGULAR EXPRESSIONS

You may be used to using Perl's regular expressions to match patterns and extract matches from a line of text, perhaps stepping through all the lines of a file. Have you ever thought of using it on the whole contents of a file at one go? You can do so, provided you're sure that the file won't be so large that you'll fill your computer's memory / swap space.

Here's and example that reads a UK postcode file, and prints out in reverse order the names of all postal towns that come under the main Aberdeen office:

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 1 - conventional programming techniques

open (FH,"postcodes") ;

while ($line = <FH>) {
if ($line =~ /Aberdeen$/) {
push @aber,$line;
}
}
for ($k=@aber;$k>=0;$k--) {
print $aber[$k];
}

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 2 - selection with grep

open (FH,"postcodes") ;

@aber = grep(/Aberdeen$/,<FH>);

print reverse @aber;

#!/usr/bin/perl

# read in and locate appropriate postcodes

# version 3 - using regular expressions

open (FH,"postcodes") ;
read (FH,$full, -s "postcodes");

@aber = ($full =~ /.*Aberdeen$/mg);

print (join("\n",reverse @aber),"\n");

In all cases, the results look like this:

$ pc1 (or pc2 or pc3)
    Turriff Aberdeenshire AB3 Aberdeen
    Strathdon Aberdeenshire AB3 Aberdeen
    Stonehaven Aberdeenshire AB3 Aberdeen
    Skene Aberdeenshire AB3 Aberdeen
    Peterhead Aberdeenshire AB4 Aberdeen
    Macduff Banffshire AB4 Aberdeen
    Laurencekirk Kincardineshire AB3 Aberdeen
    Keith Banffshire AB5 Aberdeen
    Inverurie Aberdeenshire AB5 Aberdeen
    Insch Aberdeenshire AB5 Aberdeen
    Huntly Aberdeenshire AB5 Aberdeen
    Fraserburgh Aberdeenshire AB4 Aberdeen
    Ellon Aberdeenshire AB4 Aberdeen
    Craigellachie Banffshire AB3 Aberdeen
    Buckie Banffshire AB5 Aberdeen
    Braemar Aberdeenshire AB3 Aberdeen
    Banff Banffshire AB4 Aberdeen
    Banchory Kincardineshire AB3 Aberdeen
    Ballindalloch Aberdeenshire AB3 Aberdeen
    Ballater Aberdeenshire AB3 Aberdeen
    Alford Aberdeenshire AB3 Aberdeen
    Aboyne Aberdeenshire AB3 Aberdeen
    Aberlour Banffshire AB3 Aberdeen
    ABERDEEN Aberdeenshire AB1,2 Aberdeen
$

See also Perl for Larger Projects

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - Lists
  [28] - ()
  [140] - ()
  [230] - ()
  [240] - ()
  [355] - ()
  [463] - ()
  [560] - ()
  [622] - ()
  [762] - ()
  [773] - ()
  [928] - ()
  [968] - ()
  [1304] - ()
  [1316] - ()
  [1703] - ()
  [1828] - ()
  [1917] - ()
  [1918] - ()
  [2067] - ()
  [2226] - ()
  [2295] - ()
  [2484] - ()
  [2813] - ()
  [2833] - ()
  [2996] - ()
  [3400] - ()
  [3548] - ()
  [3669] - ()
  [3870] - ()
  [3906] - ()
  [3939] - ()
  [4609] - ()

Perl - Handling Huge Data
  [639] - ()
  [762] - ()
  [975] - ()
  [1397] - ()
  [1920] - ()
  [1924] - ()
  [2376] - ()
  [2805] - ()
  [2806] - ()
  [2834] - ()
  [3374] - ()
  [3375] - ()

Perl - Data Munging
  [597] - ()
  [1316] - ()
  [1509] - ()
  [1947] - ()
  [2129] - ()
  [2702] - ()
  [3335] - ()
  [3707] - ()
  [3764] - ()
  [4620] - ()

resource index - Perl
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

Public Training Courses

Running regularly at our UK training Centre.
[Schedule] - [About] - [Book]

Perl solutions

Handling Cookies through CGI.pm

"Perl - I didn't know you could do that"

File Locking

Graphic User Interfaces (GUIs)

Data Monging

Analysing incoming data lines

Context - List, scalar and double quotes

Controlling multiple asynchronous processes in Perl

Nasty Characters in Web Applications

Using LWP to write Web Clients

Overview of what we're expecting in Perl 6

Parrot - Perl's new Virtual Machine

Making all your .html files run a Perl script

Is Perl truly an OO language?

POD (Plain Ole Documentation) - how to check it

New to programming. Portable code. Perl or Java?

Pattern Matching - a primer on regular Expressions

Writing to and reading from files

Interfacing applications to a MySQL database engine

Object Orientation in Perl - First Steps

What makes a good variable name?

Accessing a MySQL database via a browser, Perl and CGI

The wonders of Perl

Using Perl to read Microsoft Word documents

Solution Centre - all article listing

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/perl-data-monging.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard