Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
This week, we're updating our course layouts and descriptions. Presentation and materials always gently change over time, but just occasionally there's a need to make a step change to clear out some of the old and roll in the new. That's now happening - but over a long and complex site it's not instant and you'll see sections of the site changing up to and including 19th September.

See also [here] for status update
 
Analysing incoming data lines

Multiple attibute user data

So often, data is read into a program line by line or record by record into a text string - I've put this in the general area of our solutions centre as it applies equally to Perl, PHP, Python, Tcl/Tk and Java even though the names of the functions / methods used to break it down into attributes differ. The principle's the same.

Here are some rules I try to apply when sorting out data, and some techniques I use.

RULE 1 - SEPARATE OUT THE ATTRIBUTES EARLY

Data is formatted into a string for storage which arteficially adds syntax, characters and properties which are part of the presentation and not part of the data. You don't want to be handling all these extras through your program, so as a general rule strip them out ans transfer your data into an approrpiate memory structure fully and early as you've read it. That may be a list, it may be a hash, it may be an object, or it may be a composite stucture - choose something appropriate but convert once and early!

RULE 2 - USE A NAMED BLOCK OF CODE TO SEPARATE OUT THE ATTRIBUTES

Your data may be received in format "x" today but you might want to handle it from format "y" tomorrow. Write your string-to-internal-structure convertor in the form of a named block of code (a sub, method, proc, function depending on the language you're using). That makes for really easy changes and maintainance later on.

RULE 3 - USE REVERSE ENGINEERING TO SEPARATE OUT THE DATA

With more complex data, it may not be trivial to convert from a string. I'll list some techniques below, but overall you're best advised to look at the data and how it was put together and then reverse that process to extract it into your internal format. That makes for more robust results and fewer special cases that you need to look at. Elsewhere, I've described this as "type 1", "type 2" and "type 3" data handling; "Type 1" is doing enough to handle the specific data set you have. "Type 2" handles most cases but will get caught by some special cases. "Type 3" takes whatever the user throws at it, and is acheived by reversing the procedure that's use to create the data strings in the first place.

For web based applications, or for other where user input is involved, you should strive for "type 3" data handling. It should be much more secure against attack and it should get it right for correct inputs every time; think of the saving that'll make in irritating support / change requests if you can achieve it.

TECHNIQUE 1 - SPLITTING AT A KNOWN DELIMITER

If an input line has the same (or same pattern) delimiter throughout, your best technique for data extraction will be a split (or whatever it's called in the language that you're using). With a split function, you specify a delimiter character or pattern (regular expression) that's used as a separator and you burst the line into a list (array) of fields based on the separator. Common separators are
  A single space
  A comma (perhaps with spaces too)
  One or more white space characters (tab or spaces in any combo)
  A single tab (often an Excel saved file)

harry PHP Python Java
ivan Ruby Java Perl Tcl/Tk MySQL
john PHP XML Java Perl
ken Tcl/Tk Python Java Perl

URL encoded strings should be decoded using a double splitting technique - firstly split at & then each element you've got back at =.

TECHNIQUE 2 - REGULAR EXPRESSION EXTRACTION

Where an input line is a composite, then a regular expression extraction may be most approrpriate. Look at this data (from a web server log file):

flipper - - [22/Jul/2004:06:08:17 -0400] "GET /wedding/893525782_swanage.jpg HTTP/1.0" 404 -
magnet - - [22/Jul/2004:06:08:19 -0400] "GET /wedding/889799488_oxford.jpg HTTP/1.0" 404 -
aviemore - - [22/Jul/2004:06:08:20 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
seal - - [22/Jul/2004:06:09:03 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
walrus - - [22/Jul/2004:06:09:05 -0400] "GET /wedding/888142130_doniford.jpg HTTP/1.0" 404 -
flipper - - [22/Jul/2004:06:09:06 -0400] "GET /wedding/1.html HTTP/1.0" 200 504

At first, it looks like you can split at white space but it's not that simple; there's an illogical white space in the square bracketed area which really should be one parameter, and the number of spaces in the request area can vary sometimes. If you've got an extended log file format, then there's more awkward spaces too:

h5n5c1o254.bredband.skanova.com - - [31/Aug/2003:00:56:40 -0700] "GET /net/pricepolicy.html HTTP/1.1" 200 6647 "http://search.yahoo.com/search?p=How+to+calculate+your+pricing+policy&sub=Search&ei=UTF-8&fr=fp-top&b=21" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

Best approach here? A series of regular expression matches and / or splits to break the line down into the required components.

TECHNIQUE 3 - PARSING

Where data towards the end of your record depends on what's found earlier on in that record, you'll be best to use a parsing technique which adapts the data handling as it rusn to what it has already found. The classic example of an type of input that needs parsing is a computer programming language, but the technique also applies (and over multiple lines) to technologies like XML.

TECHNIQUE 4 - CSV

Data sometimes comes with a separator that may also occur within an individual field, and then the fields may be surrounded by an optional "protection" character to indicate where a character that looks like a delimiter really isn't.

This is normally known as a CSV (comma separated variable) format, but you'll find such data files written with tabs and spaces too. Here's a file example

"Hemlock, Western",N America,pale brown,500,4,n,"construction, joinery"
"Larch, Japanese",Europe,reddish brown,560,3,n,"stakes, construction"
Parana Pine,S America,golden brown and red streaks,550,4,y,"interior joinery,plywood"
"Pine, Corsican", Europe,light yellow-brown,510,4,n,"joinery, construction"
"Pine, maritime",Europe, pale brown to yellow,510,3,n,"pallets,packaging"
"Pine, pitch",South USA,yellow-brown to red-brown,670,3,n,heavy construction

Handling of CSV files can be a bit messy; in PHP there are special built in routines to help you, in Perl look on the CPAN or "roll your own". It gets all the more complex if you need to support if the protection character you're using can occur within the protected strings ....

TESTING

Test, test and test again ;-)

a) Simple cases

To check the mechanism try some simple cases

b) gentle data

A few lines of data with nothing "nasty" in there, graduating through to lines that include things that might be a bit awkward - alternative but allowed delimers and data errors for example.

c) Special / awkward cases

Especially if your input is public facing / provided look for all the special cases and awkward cusses you can see and try them out. At best, every last one should work. At worst, errors should be "safe" and not cause any security problems. Awkward inputs might include special cases with missing fields, strings that include back and forward quotes (beware injection attacks) and inputs that might be use internally to make up file names that the user could specify as ../../../windows/cmd.exe or similar!

d) Whole big files of data

Throw in whole big files of data - whatever you've got there from the past - and see that they don't find anything unexpected; whilst you cannot check every single rerord read manually, you could use something like a we command or a grep command at opertaing system level to generate you some statistics about your inputs then ensure that the same thing applies to the data that you've got within your program


See also Perl for larger projects

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - More on Character Strings
  [3927] First match or all matches? Perl Regular Expressions - (2012-11-19)
  [3707] Converting codons via Amino Acids to Proteins in Perl - (2012-04-25)
  [3650] Possessive Regular Expression Matching - Perl, Objective C and some other languages - (2012-03-12)
  [3630] Serialsing and unserialising data for storage and transfer in Perl - (2012-02-28)
  [3546] The difference between dot (a.k.a. full stop, period) and comma in Perl - (2011-12-09)
  [3411] Single and double quotes strings in Perl - what is the difference? - (2011-08-30)
  [3332] DNA to Amino Acid - a sample Perl script - (2011-06-24)
  [3322] How much has Perl (and other languages) changed? - (2011-06-10)
  [3100] Looking ahead and behind in Regular Expressions - double matching - (2010-12-23)
  [3059] Object Orientation in an hour and other Perl Lectures - (2010-11-18)
  [2993] Arrays v Lists - what is the difference, why use one or the other - (2010-10-10)
  [2877] Further more advanced Perl examples - (2010-07-19)
  [2874] Unpacking a Perl string into a list - (2010-07-16)
  [2834] Teaching examples in Perl - third and final part - (2010-06-27)
  [2801] Binary data handling with unpack in Perl - (2010-06-10)
  [2657] Want to do a big batch edit? Nothing beats Perl! - (2010-03-01)
  [2379] Making variables persistant, pretending a database is a variable and other Perl tricks - (2009-08-27)
  [2230] Running a piece of code is like drinking a pint of beer - (2009-06-11)
  [1947] Perl substitute - the e modifier - (2008-12-16)
  [1735] Finding words and work boundaries (MySQL, Perl, PHP) - (2008-08-03)
  [1727] Equality and looks like tests - Perl - (2008-07-29)
  [1510] Handling Binary data (.gif file example) in Perl - (2008-01-17)
  [1336] Ignore case in Regular Expression - (2007-09-08)
  [1305] Regular expressions made easy - building from components - (2007-08-16)
  [1251] Substitute operator / modifiers in Perl - (2007-06-28)
  [1230] Commenting a Perl Regular Expression - (2007-06-12)
  [1222] Perl, the substitute operator s - (2007-06-08)
  [943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
  [928] C++ and Perl - why did they do it THAT way? - (2006-11-16)
  [737] Coloured text in a terminal from Perl - (2006-05-29)
  [608] Don't expose your regular expressions - (2006-02-15)
  [597] Storing a regular expression in a perl variable - (2006-02-09)
  [586] Perl Regular Expressions - finding the position and length of the match - (2006-02-02)
  [583] Remember to process blank lines - (2006-01-31)
  [453] Commenting Perl regular expressions - (2005-09-30)

String Handling in PHP
  [4072] Splitting the difference with PHP - (2013-04-27)
  [4071] Setting up strings in PHP - (2013-04-27)
  [3790] Solution looking for a problem? Lookahead and Lookbehind - (2012-06-30)
  [3789] More than just matching with a regular expression in PHP - (2012-06-30)
  [3788] Getting more than a yes / no answer from a regular expression pattern match - (2012-06-30)
  [3534] Learning to program in PHP - Regular Expression and Associative Array examples - (2011-12-01)
  [3516] Regular Expression modifiers in PHP - summary table - (2011-11-12)
  [3515] PHP - moving from ereg to preg for regular expressions - (2011-11-11)
  [3424] Divide 10000 by 17. Do you get 588.235294117647, 588.24 or 588? - Ruby and PHP - (2011-09-08)
  [3020] Handling (expanding) tabs in PHP - (2010-10-29)
  [2629] Curly braces within double quoted strings in PHP - (2010-02-09)
  [2238] Handling nasty characters - Perl, PHP, Python, Tcl, Lua - (2009-06-14)
  [2165] Making Regular Expressions easy to read and maintain - (2009-05-10)
  [2046] Finding variations on a surname - (2009-02-17)
  [1799] Regular Expressions in PHP - (2008-09-16)
  [1613] Regular expression for 6 digits OR 25 digits - (2008-04-16)
  [1603] Do not SHOUT and do not whisper - (2008-04-06)
  [1533] Short and sweet and sticky - PHP form input - (2008-02-06)
  [1372] A taster PHP expression ... - (2007-09-30)
  [1336] Ignore case in Regular Expression - (2007-09-08)
  [1195] Regular Express Primer - (2007-05-20)
  [1058] PHP Regular expression to extrtact link and text - (2007-01-31)
  [1008] Date conversion - PHP - (2006-12-26)
  [728] Looking ahead and behind in a Regular Expression - (2006-05-22)
  [716] Evaluating arithmetic expressions in configuration files - (2006-05-10)
  [642] How similar are two words - (2006-03-11)
  [608] Don't expose your regular expressions - (2006-02-15)
  [589] Robust PHP user inputs - (2006-02-03)
  [574] PHP - dividing a string up into pieces - (2006-01-23)
  [560] The fencepost problem - (2006-01-10)
  [558] Converting between acres and hectares - (2006-01-08)
  [493] Running a Perl script within a PHP page - (2005-11-12)
  [463] Splitting the difference - (2005-10-13)
  [422] PHP Magic Quotes - (2005-08-22)
  [337] the array returned by preg_match_all - (2005-06-06)
  [54] PHP and natural sorting - (2004-09-19)
  [31] Here documents - (2004-08-28)

Python - String Handling
  [4213] Formatting options in Python - (2013-11-16)
  [4152] Why are bus fares so high? - (2013-08-18)
  [4027] Collections in Python - list tuple dict and string. - (2013-03-04)
  [3886] Formatting output - why we need to, and first Python example - (2012-10-09)
  [3796] Backquote, backtic, str and repr in Python - conversion object to string - (2012-07-05)
  [3469] Teaching dilemma - old tricks and techniques, or recent enhancements? - (2011-10-08)
  [3468] Python string formatting - the move from % to str.format - (2011-10-08)
  [3349] Formatting output in Python through str.format - (2011-07-07)
  [3218] Matching a license plate or product code - Regular Expressions - (2011-03-28)
  [3090] Matching to a string - what if it matches in many possible ways? - (2010-12-17)
  [2814] Python - splitting and joining strings - (2010-06-16)
  [2780] Formatted Printing in Python - (2010-05-25)
  [2765] Running operating system commands from your Python program - (2010-05-14)
  [2721] Regular Expressions in Python - (2010-04-14)
  [2692] Flexible search and replace in Python - (2010-03-25)
  [2406] Pound Sign in Python Program - (2009-09-15)
  [2284] Strings as collections in Python - (2009-07-12)
  [1876] Python Regular Expressions - (2008-11-08)
  [1608] Underlining in Perl and Python - the x and * operator in use - (2008-04-12)
  [1517] Python - formatting objects - (2008-01-24)
  [1195] Regular Express Primer - (2007-05-20)
  [1110] Python - two different splits - (2007-03-15)
  [970] String duplication - x in Perl, * in Python and Ruby - (2006-12-07)
  [954] Splitting Pythons in Bradford - (2006-11-29)
  [943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
  [903] Pieces of Python - (2006-10-23)
  [773] Breaking bread - (2006-06-22)
  [560] The fencepost problem - (2006-01-10)
  [496] Python printf - (2005-11-15)
  [463] Splitting the difference - (2005-10-13)
  [324] The backtick operator in Python and Perl - (2005-05-25)

Tcl/Tk - String Handling in Tcl
  [4205] Regular Expression Substitution - Tcl - (2013-11-12)
  [3576] Tcl - apparently odd behaviour of string trimleft - (2012-01-13)
  [3285] Extracting data from a string / line from file - Tcl - (2011-05-10)
  [3192] Tcl - Some example of HOW TO in handling data files and formats - (2011-03-04)
  [2472] split and join in tcl and expect - (2009-10-23)
  [1601] Replacing the last comma with an and - (2008-04-04)
  [1410] Tcl / regsub - changing a string and using interesting bits - (2007-10-27)
  [1403] Square Bracket protection in Tcl - (2007-10-23)
  [943] Matching within multiline strings, and ignoring case in regular expressions - (2006-11-25)
  [779] The fragility of pancakes - and better structures - (2006-06-26)
  [404] How to check that a string contains a number in Tcl - (2005-08-06)

Java - Strings
  [3048] String handling - from first steps to practical examples - (2010-11-13)
  [2649] Length, size or capacity in Java? - (2010-02-24)
  [1557] Trying out our Java examples on our web site - (2008-02-27)
  [1446] An answer to a student asking 'Help' - (2007-11-27)

resource index - Deployment
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/general- ... lines.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard