Exercises, examples and other material relating to training module P667. This topic is presented on public course
Perl for Larger Projects
If you've so much data that it won't all fit into memory all at once, you may not be able to use conventional programming techniques to complete your task. We define a data set such as this as "huge data"; it's impossible to handle in some languages, but very practical in Perl. This module doesn't introduce many new language features; instead, it shows you how to use what you already know to handle huge data practically.
Related technical and longer articles
Data Monging
Articles and tips on this subject | updated |
3375 | How to interact with a Perl program while it is processing data If you have a long running program, how do you monitor its progress?
You could use tail -f on an output file ... but there are other options too.
a) You can output a progress line; there's an example of this from the recent Perl for Larget Projects course ... [here].
b) You can trap ^C ... and even ... | 2011-07-31 |
3374 | Speeding up your Perl code On Friday morning - our Perl for Larger Project course - I was looking at coding efficiency / run speed with delegates. As an example, we took a data file from our web server logs - some 23 Mbytes of data, comprising about 121,000 lines, of which 1099 contained the word "melksham" in lower case.
The ... | 2011-07-30 |
2834 | Teaching examples in Perl - third and final part Three part article ... this is part 3. Jump back to part [1] [2]
Following on from two earlier posts, here is the final third of the new examples that I wrote during last week's Perl course, and to which I have added extra documentation over the last couple of days.
P212 More on Character Strings
"Does ... | 2010-06-27 (longer) |
2805 | How are you getting on? Have you ever asked someone to do something for you ... a long task, and you would like a progress report? "How are you getting on?" you'll ask ... and they'll give you an update - "I'm 75% of the way through" they'll say or - perhaps even more helpfully - "I'm nearly there, and I have some good results ... | 2010-06-18 |
2806 | Macho matching - do not do it! There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, ... | 2010-06-18 |
2376 | Long job - progress bar techniques (Perl) Here's a "Perl for Larger Projects" example --- for use in illustrating the "advanced file and directory handling" and "handling huge data set" modules.
Scenario ... I want to go through all the files and directories on a big drive, and find the largest file(s). It will take a while, so I want progress ... | 2009-08-26 |
1920 | Progress Bar Techniques - Perl Have you ever sat there and wondered "is this program nearly done ... is it still running ... how is it getting on" and wished you had a progress bar. But then have you ever watched a jerky progress bar and felt that it's more fiction than fact?
We were discussing these aspects on today's private Perl ... | 2008-12-10 |
1924 | Preventing ^C stopping / killing a program - Perl Here's a demonstration - in Perl - that shows you how to avoid a ^C (Control C) dropping you straight out of a program.
Have you ever accidentally hit ^C in the wrong window and terminated a long-running process just before it finished ... well, by setting $SIG{INT} to the address of a sub you want ... | 2008-12-07 |
1397 | Perl - progress bar, supressing ^C and coping with huge data flows If you're handling a huge amount of data (gigabytes!) in a Perl program, memory won't allow you to slurp it all into a list and you'll traverse the data with a loop from file or from database. And because of the sheer volume of data, it may take a while to process. During such proessing, you may wish ... | 2007-10-23 |
975 | Answering ALL the delegate's Perl questions During courses, questions arise. "I'll get back to that" could make people feel that I'm brushing something off ... except that I explain, early on, that some questions require a great deal of background knowledge to be answered sensibly. And I keep a list of topics that I'll be getting back to ... | 2006-12-09 |
762 | Huge data files - what happened earlier? When I'm programming a log file analysis in Perl, I'll often "slurp" the whole file into a list which I can then traverse efficiently as many times as I need. If I need to look backwards from some interesting event to see what happened in the immediate lead up to it, I can do so simply by looking at ... | 2006-06-15 |
639 | Progress bars and other dynamic reports If you've got a program that runs for a long time, your users will wish to be kept informed of progress and how much longer there is to go. Now that's not always easy to predict (and I'm sure that most of you have made fun of such forecasts in the past) but its's much much much better than sitting ... | 2006-06-05 |
Examples from our training material
behind | looking behind in huge data files |
big.start | Finding largest file, with intermediate status reports |
huge1 | A program to test handling a small part of a huge data set |
huge2 | Providing user feedback while handling huge data |
huge3 | Asking a long running application for intermediate reports |
huge3.pid | Example of the huge.pid file |
hugehunter | Long log file analysis, with progress and intermediate reporting |
makedirs | Preprocessing a huge data file to set up indexes |
makeindex | Generating a list of markers to a huge sorted data set |
mtx | Merging two huge files |
opt2 | Sorting and data filtering efficiency |
opt3 | Improving sort efficiency |
opt4 | Improving sort efficient further - caching record analysis |
optim | Optimising code to avoid repeating calculations |
out.txt | Example of search results written to file |
paws | Progress Bar Techniques |
readtime | Efficiency - reading a file in large blocks |
reg_opt | Regular expression match - inefficient example |
reg_opt1 | Regular expression match - don't save $' $` and $& |
reg_opt2 | Regular expression match - use of "o" modifier |
reg_opt3 | Regular expression match - more specific and faster |
reg_opt4 | Regular expression match - a start assertion speeds it up! |
rt2 | Handling data in chunks - chunk overlap issue solved |
site.pm | Class used in other examples in this module |
slurp | slurping and sampling |
useindex | Grab first ten sites on a topic area - QUICKLY via index |
Pictures
A happy trainee
Background information
Some modules are
available for download as a sample of our material or under an
Open Training Notes License for free download from
[here].
Topics covered in this module
Planning.
General techniques for large and huge data.
Code Optimization.
Regular Expressions.
Sorting.
Large Data.
Avoid loops.
Store data in memory.
Huge Data.
Hello HUGE world.
User feedback.
Controlling a long-running process.
Reading the data.
Arranging and storing the data.
Using a directory structure.
Indexing.
For Reference.
Complete learning
If you are looking for a complete course and not just a information on a single subject, visit our
Listing and schedule page.
Well House Consultants specialise in training courses in
Ruby,
Lua,
Python,
Perl,
PHP, and
MySQL. We run
Private Courses throughout the UK (and beyond for longer courses), and
Public Courses at our training centre in Melksham, Wiltshire, England.
It's surprisingly cost effective to come on our public courses -
even if
you live in a different
country or continent to us.
We have a technical library of over 700 books on the subjects on which we teach.
These books are available for reference at our training centre.