Perl - Handling Huge Data

Perl module P667

Handling Huge Data

Exercises, examples and other material relating to training module P667. This topic is presented on public course Perl for Larger Projects

If you've so much data that it won't all fit into memory all at once, you may not be able to use conventional programming techniques to complete your task. We define a data set such as this as "huge data"; it's impossible to handle in some languages, but very practical in Perl. This module doesn't introduce many new language features; instead, it shows you how to use what you already know to handle huge data practically.

Related technical and longer articles

Data Monging

Articles and tips on this subject		updated
3375	How to interact with a Perl program while it is processing data If you have a long running program, how do you monitor its progress? You could use tail -f on an output file ... but there are other options too. a) You can output a progress line; there's an example of this from the recent Perl for Larget Projects course ... [here]. b) You can trap ^C ... and even ...	2011-07-31
3374	Speeding up your Perl code On Friday morning - our Perl for Larger Project course - I was looking at coding efficiency / run speed with delegates. As an example, we took a data file from our web server logs - some 23 Mbytes of data, comprising about 121,000 lines, of which 1099 contained the word "melksham" in lower case. The ...	2011-07-30
2834	Teaching examples in Perl - third and final part Three part article ... this is part 3. Jump back to part [1] [2] Following on from two earlier posts, here is the final third of the new examples that I wrote during last week's Perl course, and to which I have added extra documentation over the last couple of days. P212 More on Character Strings "Does ...	2010-06-27 (longer)
2805	How are you getting on? Have you ever asked someone to do something for you ... a long task, and you would like a progress report? "How are you getting on?" you'll ask ... and they'll give you an update - "I'm 75% of the way through" they'll say or - perhaps even more helpfully - "I'm nearly there, and I have some good results ...	2010-06-18
2806	Macho matching - do not do it! There's something vaguely macho about doing a grand regular expression match to do all your filtering in a single line of code - but being macho may be less than efficient. It may be far better to do two shorter matches, with the first quickly rejecting records which don't need to be handled in detail, ...	2010-06-18
2376	Long job - progress bar techniques (Perl) Here's a "Perl for Larger Projects" example --- for use in illustrating the "advanced file and directory handling" and "handling huge data set" modules. Scenario ... I want to go through all the files and directories on a big drive, and find the largest file(s). It will take a while, so I want progress ...	2009-08-26
1920	Progress Bar Techniques - Perl Have you ever sat there and wondered "is this program nearly done ... is it still running ... how is it getting on" and wished you had a progress bar. But then have you ever watched a jerky progress bar and felt that it's more fiction than fact? We were discussing these aspects on today's private Perl ...	2008-12-10
1924	Preventing ^C stopping / killing a program - Perl Here's a demonstration - in Perl - that shows you how to avoid a ^C (Control C) dropping you straight out of a program. Have you ever accidentally hit ^C in the wrong window and terminated a long-running process just before it finished ... well, by setting $SIG{INT} to the address of a sub you want ...	2008-12-07
1397	Perl - progress bar, supressing ^C and coping with huge data flows If you're handling a huge amount of data (gigabytes!) in a Perl program, memory won't allow you to slurp it all into a list and you'll traverse the data with a loop from file or from database. And because of the sheer volume of data, it may take a while to process. During such proessing, you may wish ...	2007-10-23
975	Answering ALL the delegate's Perl questions During courses, questions arise. "I'll get back to that" could make people feel that I'm brushing something off ... except that I explain, early on, that some questions require a great deal of background knowledge to be answered sensibly. And I keep a list of topics that I'll be getting back to ...	2006-12-09
762	Huge data files - what happened earlier? When I'm programming a log file analysis in Perl, I'll often "slurp" the whole file into a list which I can then traverse efficiently as many times as I need. If I need to look backwards from some interesting event to see what happened in the immediate lead up to it, I can do so simply by looking at ...	2006-06-15
639	Progress bars and other dynamic reports If you've got a program that runs for a long time, your users will wish to be kept informed of progress and how much longer there is to go. Now that's not always easy to predict (and I'm sure that most of you have made fun of such forecasts in the past) but its's much much much better than sitting ...	2006-06-05

Examples from our training material

behind	looking behind in huge data files
big.start	Finding largest file, with intermediate status reports
huge1	A program to test handling a small part of a huge data set
huge2	Providing user feedback while handling huge data
huge3	Asking a long running application for intermediate reports
huge3.pid	Example of the huge.pid file
hugehunter	Long log file analysis, with progress and intermediate reporting
makedirs	Preprocessing a huge data file to set up indexes
makeindex	Generating a list of markers to a huge sorted data set
mtx	Merging two huge files
opt2	Sorting and data filtering efficiency
opt3	Improving sort efficiency
opt4	Improving sort efficient further - caching record analysis
optim	Optimising code to avoid repeating calculations
out.txt	Example of search results written to file
paws	Progress Bar Techniques
readtime	Efficiency - reading a file in large blocks
reg_opt	Regular expression match - inefficient example
reg_opt1	Regular expression match - don't save $' $` and $&
reg_opt2	Regular expression match - use of "o" modifier
reg_opt3	Regular expression match - more specific and faster
reg_opt4	Regular expression match - a start assertion speeds it up!
rt2	Handling data in chunks - chunk overlap issue solved
site.pm	Class used in other examples in this module
slurp	slurping and sampling
useindex	Grab first ten sites on a topic area - QUICKLY via index

Pictures

A happy trainee

Background information

Some modules are available for download as a sample of our material or under an Open Training Notes License for free download from [here].

Topics covered in this module

Planning.
General techniques for large and huge data.
Code Optimization.
Regular Expressions.
Sorting.
Large Data.
Avoid loops.
Store data in memory.
Huge Data.
Hello HUGE world.
User feedback.
Controlling a long-running process.
Reading the data.
Arranging and storing the data.
Using a directory structure.
Indexing.
For Reference.

Complete learning

If you are looking for a complete course and not just a information on a single subject, visit our Listing and schedule page.

Well House Consultants specialise in training courses in Ruby, Lua, Python, Perl, PHP, and MySQL. We run Private Courses throughout the UK (and beyond for longer courses), and Public Courses at our training centre in Melksham, Wiltshire, England. It's surprisingly cost effective to come on our public courses - even if you live in a different country or continent to us.

We have a technical library of over 700 books on the subjects on which we teach. These books are available for reference at our training centre.