Analysing incoming data lines

For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))

Multiple attibute user data

So often, data is read into a program line by line or record by record into a text string - I've put this in the general area of our solutions centre as it applies equally to Perl, PHP, Python, Tcl/Tk and Java even though the names of the functions / methods used to break it down into attributes differ. The principle's the same.

Here are some rules I try to apply when sorting out data, and some techniques I use.

RULE 1 - SEPARATE OUT THE ATTRIBUTES EARLY

Data is formatted into a string for storage which arteficially adds syntax, characters and properties which are part of the presentation and not part of the data. You don't want to be handling all these extras through your program, so as a general rule strip them out ans transfer your data into an approrpiate memory structure fully and early as you've read it. That may be a list, it may be a hash, it may be an object, or it may be a composite stucture - choose something appropriate but convert once and early!

RULE 2 - USE A NAMED BLOCK OF CODE TO SEPARATE OUT THE ATTRIBUTES

Your data may be received in format "x" today but you might want to handle it from format "y" tomorrow. Write your string-to-internal-structure convertor in the form of a named block of code (a sub, method, proc, function depending on the language you're using). That makes for really easy changes and maintainance later on.

RULE 3 - USE REVERSE ENGINEERING TO SEPARATE OUT THE DATA

With more complex data, it may not be trivial to convert from a string. I'll list some techniques below, but overall you're best advised to look at the data and how it was put together and then reverse that process to extract it into your internal format. That makes for more robust results and fewer special cases that you need to look at. Elsewhere, I've described this as "type 1", "type 2" and "type 3" data handling; "Type 1" is doing enough to handle the specific data set you have. "Type 2" handles most cases but will get caught by some special cases. "Type 3" takes whatever the user throws at it, and is acheived by reversing the procedure that's use to create the data strings in the first place.

For web based applications, or for other where user input is involved, you should strive for "type 3" data handling. It should be much more secure against attack and it should get it right for correct inputs every time; think of the saving that'll make in irritating support / change requests if you can achieve it.

TECHNIQUE 1 - SPLITTING AT A KNOWN DELIMITER

If an input line has the same (or same pattern) delimiter throughout, your best technique for data extraction will be a split (or whatever it's called in the language that you're using). With a split function, you specify a delimiter character or pattern (regular expression) that's used as a separator and you burst the line into a list (array) of fields based on the separator. Common separators are
  A single space
  A comma (perhaps with spaces too)
  One or more white space characters (tab or spaces in any combo)
  A single tab (often an Excel saved file)

harry PHP Python Java
ivan Ruby Java Perl Tcl/Tk MySQL
john PHP XML Java Perl
ken Tcl/Tk Python Java Perl

URL encoded strings should be decoded using a double splitting technique - firstly split at & then each element you've got back at =.

TECHNIQUE 2 - REGULAR EXPRESSION EXTRACTION

Where an input line is a composite, then a regular expression extraction may be most approrpriate. Look at this data (from a web server log file):

flipper - - [22/Jul/2004:06:08:17 -0400] "GET /wedding/893525782_swanage.jpg HTTP/1.0" 404 -
magnet - - [22/Jul/2004:06:08:19 -0400] "GET /wedding/889799488_oxford.jpg HTTP/1.0" 404 -
aviemore - - [22/Jul/2004:06:08:20 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
seal - - [22/Jul/2004:06:09:03 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
walrus - - [22/Jul/2004:06:09:05 -0400] "GET /wedding/888142130_doniford.jpg HTTP/1.0" 404 -
flipper - - [22/Jul/2004:06:09:06 -0400] "GET /wedding/1.html HTTP/1.0" 200 504

At first, it looks like you can split at white space but it's not that simple; there's an illogical white space in the square bracketed area which really should be one parameter, and the number of spaces in the request area can vary sometimes. If you've got an extended log file format, then there's more awkward spaces too:

h5n5c1o254.bredband.skanova.com - - [31/Aug/2003:00:56:40 -0700] "GET /net/pricepolicy.html HTTP/1.1" 200 6647 "http://search.yahoo.com/search?p=How+to+calculate+your+pricing+policy&sub=Search&ei=UTF-8&fr=fp-top&b=21" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

Best approach here? A series of regular expression matches and / or splits to break the line down into the required components.

TECHNIQUE 3 - PARSING

Where data towards the end of your record depends on what's found earlier on in that record, you'll be best to use a parsing technique which adapts the data handling as it rusn to what it has already found. The classic example of an type of input that needs parsing is a computer programming language, but the technique also applies (and over multiple lines) to technologies like XML.

TECHNIQUE 4 - CSV

Data sometimes comes with a separator that may also occur within an individual field, and then the fields may be surrounded by an optional "protection" character to indicate where a character that looks like a delimiter really isn't.

This is normally known as a CSV (comma separated variable) format, but you'll find such data files written with tabs and spaces too. Here's a file example

"Hemlock, Western",N America,pale brown,500,4,n,"construction, joinery"
"Larch, Japanese",Europe,reddish brown,560,3,n,"stakes, construction"
Parana Pine,S America,golden brown and red streaks,550,4,y,"interior joinery,plywood"
"Pine, Corsican", Europe,light yellow-brown,510,4,n,"joinery, construction"
"Pine, maritime",Europe, pale brown to yellow,510,3,n,"pallets,packaging"
"Pine, pitch",South USA,yellow-brown to red-brown,670,3,n,heavy construction

Handling of CSV files can be a bit messy; in PHP there are special built in routines to help you, in Perl look on the CPAN or "roll your own". It gets all the more complex if you need to support if the protection character you're using can occur within the protected strings ....

TESTING

Test, test and test again ;-)

a) Simple cases

To check the mechanism try some simple cases

b) gentle data

A few lines of data with nothing "nasty" in there, graduating through to lines that include things that might be a bit awkward - alternative but allowed delimers and data errors for example.

c) Special / awkward cases

Especially if your input is public facing / provided look for all the special cases and awkward cusses you can see and try them out. At best, every last one should work. At worst, errors should be "safe" and not cause any security problems. Awkward inputs might include special cases with missing fields, strings that include back and forward quotes (beware injection attacks) and inputs that might be use internally to make up file names that the user could specify as ../../../windows/cmd.exe or similar!

d) Whole big files of data

Throw in whole big files of data - whatever you've got there from the past - and see that they don't find anything unexpected; whilst you cannot check every single rerord read manually, you could use something like a we command or a grep command at opertaing system level to generate you some statistics about your inputs then ensure that the same thing applies to the data that you've got within your program

See also Perl for larger projects

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - More on Character Strings
  [453] - ()
  [583] - ()
  [586] - ()
  [597] - ()
  [608] - ()
  [737] - ()
  [928] - ()
  [943] - ()
  [1222] - ()
  [1230] - ()
  [1251] - ()
  [1305] - ()
  [1336] - ()
  [1510] - ()
  [1727] - ()
  [1735] - ()
  [1947] - ()
  [2230] - ()
  [2379] - ()
  [2657] - ()
  [2801] - ()
  [2834] - ()
  [2874] - ()
  [2877] - ()
  [2993] - ()
  [3059] - ()
  [3100] - ()
  [3322] - ()
  [3332] - ()
  [3411] - ()
  [3546] - ()
  [3630] - ()
  [3650] - ()
  [3707] - ()
  [3927] - ()
  [4452] - ()

String Handling in PHP
  [31] - ()
  [54] - ()
  [337] - ()
  [422] - ()
  [463] - ()
  [493] - ()
  [558] - ()
  [560] - ()
  [574] - ()
  [589] - ()
  [608] - ()
  [642] - ()
  [716] - ()
  [728] - ()
  [1008] - ()
  [1058] - ()
  [1195] - ()
  [1336] - ()
  [1372] - ()
  [1533] - ()
  [1603] - ()
  [1613] - ()
  [1799] - ()
  [2046] - ()
  [2165] - ()
  [2238] - ()
  [2629] - ()
  [3020] - ()
  [3424] - ()
  [3515] - ()
  [3516] - ()
  [3534] - ()
  [3788] - ()
  [3789] - ()
  [3790] - ()
  [4071] - ()
  [4072] - ()

Python - String Handling
  [324] - ()
  [463] - ()
  [496] - ()
  [560] - ()
  [773] - ()
  [903] - ()
  [943] - ()
  [954] - ()
  [970] - ()
  [1110] - ()
  [1195] - ()
  [1517] - ()
  [1608] - ()
  [1876] - ()
  [2284] - ()
  [2406] - ()
  [2692] - ()
  [2721] - ()
  [2765] - ()
  [2780] - ()
  [2814] - ()
  [3090] - ()
  [3218] - ()
  [3349] - ()
  [3468] - ()
  [3469] - ()
  [3796] - ()
  [3886] - ()
  [4027] - ()
  [4152] - ()
  [4213] - ()
  [4307] - ()
  [4360] - ()
  [4593] - ()
  [4595] - ()
  [4659] - ()

Tcl/Tk - String Handling in Tcl
  [404] - ()
  [779] - ()
  [943] - ()
  [1403] - ()
  [1410] - ()
  [1601] - ()
  [2472] - ()
  [3192] - ()
  [3285] - ()
  [3576] - ()
  [4205] - ()
  [4454] - ()
  [4455] - ()

Java - Strings
  [1446] - ()
  [1557] - ()
  [2649] - ()
  [3048] - ()
  [4393] - ()
  [4414] - ()

resource index - Deployment
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

Public Training Courses

Running regularly at our UK training Centre.
[Schedule] - [About] - [Book]

Deployment solutions

What are NIS, LDAP, DNS and BIND

Firewall fundamentals - Linux

Multiple domains on a single web server

The Author of the Apache httpd and Tomcat Course

Running CGI scripts in Apache Tomcat

Running SSI (Server Side Include) scripts in Apache Tomcat

awk - pattern scanning and formatting

Linux and httpd file system layout - LAMP server

Using current exchange rates on a web page

Open Source - here to stay, or a flash in the pan?

Web Forensics

Handling XML in Java

Overview of Ant - a Java based build tool

Korn Shell - Basics

Korn Shell - History, storing to file, commands

Korn Shell Scripting - an introduction

Web applications using LAMP technologies

A Web interface for your Linux admin tasks

Windows clients to a Linux Server

Tiny guide to vi basics

Using LWP to write Web Clients

Load balancing and robustness - httpd, tomcat, PHP, Servlets, MySQL

Making all your .html files run a Perl script

New to programming. Portable code. Perl or Java?

Log file analysis - short Python example

Why Partition a disc drive?

The proxy jServ, jk, jk2, warp and jk story

Load Balancing and Clustering

Windows - running Apache and MySQL from a CD

Linux and XP - loading a dual boot system

Solution Centre - all article listing

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/general- ... lines.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard