Last week, I found myself teaching a
Multi-vendor Advanced Unix Data Tools and Techniques course as a guest presenter. The tools that the course 'majored' on are grep, sed, awk and Perl ... being an advanced course, some knowledge was assumed ahead of time, and so reference was made to other utilities with an assumption of fundamental knowledge (cut, sort, and head came up amongst others), and discussion in class also extended to Python. Shells covered were Bash and Ksh.
The objective of the course was to familiarise delegates with the advanced features of the data handling tools, so that they could make maximum use of those tools thereafter in their work - typically in the handling and filtering of large flows of data.
And the $64,000 question that umbrellas over the whole couse is "which tool should we use when?"
Let's go through the tools, one at a time, and see how it fits in to a pattern ...
Grep is an excellent
data filter. Taking incoming flows of information (typically records which are new line delimited), grep gives you the ability to look within each line for a pattern, in isolation from other lines. Results can simply be output, or a count can be output (-c option) , or non-matching lines can be output (-v option). Line numbers can be reported, multiple input files can be handled, etc. A flexible tool that can perform literal matches (
fgrep), basic regular expression matches (
grep) and extended regular expression matches (
egrep), but cannot itself perform inline edits, substitutions, or look over line number ranges or multiple lines.
That's where you move on to ...
Sed. A stream editor - in other words, a tool which takes an incoming flow of data, just like grep, but has rather more capability in what it can do with each line. Like grep, sed can select lines based on them matching a regular expression, but it can also select based on line numbers, and on ranges of lines.
sed -n 3,/AL/p railstats.xyz
says "output only from the third line to the first line thereafter that contains AL". And sed, while it's running, can look ahead to the following line if you have data with continuation lines for example, and it can even store data into and recover it from a "hold buffer". As well as just outputting lines / records, sed can edit lines, substituting the matching section of a line with something else, perhaps a literal string of perhaps something that's based on the incoming match via a
backreference. Se even has a labelling, looping and conditional check capability within its commands. But are they the best tool for more complex edits?
That's where you move on to ...
Awk. Like sed, Awk takes an incoming flow of data and looks at each line to see if it matches patterns, at lines by number, and in the case of awk it can also be on all sorts of other criteria too. Each line that matches an awk pattern can have a whole block of code run on it, with conditionals, loops, and variables in the classic "programming" way; the "K" in the name aw
K is the initial of
Brian Kernighan - co-author of C with Dennis Richie, so it's no co-incidence that awk has similaries to (and the powers of) C, and with C being very much a mainstream language it makes awk's syntax look very familiar indeed to most programmers. In awk programs, you can even have arrays and user defined functions. Here's an example of awk in use - finding all lines with "Lake" in the 7th field of a data file, and reporting a number of the fields:
wizard:graham graham$ awk 'BEGIN{count=0;FS="\t";OFS="|";};
$7 ~ /Lake/{count++;print NR,count,$2,$3,$12,$7};
END{OFS=" ";print "Matched",count,"of",NR}' railstats.xyz
86|1|TLK|B94 5SE|10884|The Lakes
1026|2|LAK|IP27 9AD|536|Lakenheath
1238|3|OXN|LA9 7HG|350292|Oxenholme Lake District
1792|4|LKE|PO36 8PJ|67162|Lake
Matched 4 of 2539
wizard:graham graham$
Awk programs can be stored into files so that you don't have to type it all onto the command line every time. But they are inherrantly line by line processing based, and lack mainstream facilities to pull in central library code that you want to use in lots of different scripts.
That's where you move on to ...
Perl. The "Practical Extraction and Reporting Language". Perl's a very feature-rich programming language indeed, with a very easy to run interface - in other words, you can simply put your script into a file and say "go run this Perl program" without worrying about compile and load cycles or anything like that. You
can put your Perl program on the command line (-e option), but you'll rarely do so. Code is very commonly shared between Perl scripts / Perl programs via
use statements, with a very complete library structure allowing access to common code that's distributed with Perl, you own code that you want to share between yor Perl programs, and a central resource (the CPAN) of code modules that other people have written and shared. However, Perl is so feature rich that it's not easy to learn, and it's often very hard to read code that's been written by others - especially by others for whom maintainable programming isn't a passion. You
can do just about anything in computing / data terms with Perl. But if you want to work in a team, each maintaining each other's code and / or with longer term projects on the same data, you may want to go for something that's "object oriented" through and through.
That's where you move on to ...
Python. The "Advanced Unix Tools" course had only an appendix in the notes on Python, and that's fair enough because learning Python requires a complete course; we've moved so far from
grep after all. Python's a superb data tool, and much more. If you're working with the same data but manipulating it in different ways, if you're working in a team, then it should be a very serious candidate to be your "tool of choice". You'll notice that this paragraph is not ending with "that's where you move on". You don't move on - for the most complex of data manipulation tasks, Python is my tool of choice.
(written 2012-10-22, updated 2012-10-23)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Y050 - Python - General [16] Python training - (2004-08-16)
[2017] Python - a truly dynamic language - (2009-01-30)
[2020] Learning Python - many new example programs - (2009-01-31)
[2227] Learning PHP, Ruby, Lua and Python - upcoming courses - (2009-06-11)
[2285] Great new diagrams for our notes ... Python releases - (2009-07-13)
[2367] Learning to program - how to jump the first hurdles - (2009-08-20)
[2394] Two days of demonstration scripts in Python - (2009-09-05)
[2504] Learning to program in ... - (2009-11-15)
[2778] Learning to program in Python 2 ... and / or in Python 3 - (2010-05-24)
[2822] Python training courses for use with ESRI ArcMap software - (2010-06-23)
[3076] Python through the Snow - (2010-12-01)
[3463] Busy weekend of contrasts. - (2011-10-03)
[3489] Python courses and Private courses - gently updating our product to keep it ahead of the game - (2011-10-20)
[3519] Python - current versions and implementations (CPython, Jython, IronPython etc) - (2011-11-13)
[3798] When you should use Object Orientation even in a short program - Python example - (2012-07-06)
[3816] Want to escape the Olympics? Learn to program in the countryside! - (2012-07-23)
[3903] Python Programming class for delegates who have already self-taught the basics - (2012-10-25)
[3911] How well do you know Perl and / or Python? - (2012-11-04)
[3935] Whether you have programmed before or not, we can teach you Python - (2012-11-25)
[4236] Using Python to analyse last years forum logs. Good coding practise discussion. - (2014-01-01)
[4295] A longer Python ... training course - (2014-09-16)
[4408] Additional Python courses added to our schedule - (2015-01-29)
[4434] Public training courses - upcoming dates - (2015-02-21)
[4558] Well House Consultants - Python courses / what's special. - (2015-10-28)
[4656] Identifying the first and last records in a sequence - (2016-02-26)
[4712] A reminder of the key issues to consider in moving from Python 2 to Python 3 - (2016-10-30)
P050 - Perl - General [116] The next generation of programmer - (2004-11-13)
[400] New in the shops - (2005-08-01)
[743] How to debug a Perl program - (2006-06-04)
[1750] Glorious (?) 12th August - what a Pe(a)rl! - (2008-08-12)
[1897] Keeping on an even keel - (2008-11-21)
[2228] Where do I start when writing a program? - (2009-06-11)
[2242] So what is this thing called Perl that I keep harping on about? - (2009-06-15)
[2374] Lead characters on Perl variable names - (2009-08-24)
[2736] Perl Course FAQ - (2010-04-23)
[2783] The Perl Survey - (2010-05-27)
[2825] Perl course - is it tailored to Linux, or Microsoft Windows? - (2010-06-25)
[2971] Should the public sector compete with businesses? and other deep questions - (2010-09-26)
[3093] How many toilet rolls - hotel inventory and useage - (2010-12-18)
[3322] How much has Perl (and other languages) changed? - (2011-06-10)
[3332] DNA to Amino Acid - a sample Perl script - (2011-06-24)
[3407] Perl - a quick reminder and revision. Test yourself! - (2011-08-26)
[3823] Know Python or PHP? Want to learn Perl too? - (2012-07-31)
[4296] Polishing the Perl courses - updated training - (2014-09-17)
[4301] Perl - still a very effective language indeed for extracting and reporting - (2014-09-20)
A166 - Web Application Deployment - Linux Utilities [63] Almost like old times - (2004-09-26)
[71] Comparators in Linux and Unix - (2004-10-03)
[1361] Korn shell course - (2007-09-22)
[1366] awk - a powerful data extraction and manipulation tool - (2007-09-25)
[1690] Conversion of c/r line ends to l/f line ends - (2008-06-28)
[2145] Using the internet to remotely check for power failure at home (PHP) - (2009-04-29)
[2320] Helping new arrivals find out about source code examples - (2009-08-03)
[2484] Finding text and what surrounds it - contextual grep - (2009-10-30)
[2638] Finding what has changed - Linux / Unix - (2010-02-17)
[3446] Awk v Perl - (2011-09-18)
[3764] Shell, Awk, Perl of Python? - (2012-06-14)
[4586] Extending your bash shell with aliases, functions and extra commands - (2015-11-28)
[4682] One line scripts - Awk, Perl and Ruby - (2016-05-20)
A051 - Web Application Deployment - Linux - General [2023] sw_vers - what version of OSX am I running? - (2009-02-03)
[2035] 1234567890 ... coming up on Friday 13th - (2009-02-11)
[3219] How do I become a Linux System Administrator? - (2011-03-28)
[4259] Upgrading our training systems to all the current stable versions - (2014-04-07)
Some other Articles
Taking the lead, not the dog, for a walk.How should we choose our Wiltshire Police and Crime Commissioner?Want to help us improve transport in Wiltshire? Here is how!Shell - Grep - Sed - Awk - Perl - Python - which to use when?How much parking should there be at Melksham Campus? The Xxxxx Guest House in Xxxxxxxxxxx - my stay reviewedFather Christmas to be on train in MelkshamThe course must go on - improvements to tutor travel plans, with immediate effectAutumn scenes from Melksham