Multiple attibute user data
So often, data is read into a program line by line or record by record into a text string - I've put this in the general area of our solutions centre as it applies equally to Perl, PHP, Python, Tcl/Tk and Java even though the names of the functions / methods used to break it down into attributes differ. The principle's the same.
Here are some rules I try to apply when sorting out data, and some techniques I use.
RULE 1 - SEPARATE OUT THE ATTRIBUTES EARLY
Data is formatted into a string for storage which arteficially adds syntax, characters and properties which are part of the presentation and not part of the data. You don't want to be handling all these extras through your program, so as a general rule strip them out ans transfer your data into an approrpiate memory structure fully and early as you've read it. That may be a list, it may be a hash, it may be an object, or it may be a composite stucture - choose something appropriate but convert once and early!
RULE 2 - USE A NAMED BLOCK OF CODE TO SEPARATE OUT THE ATTRIBUTES
Your data may be received in format "x" today but you might want to handle it from format "y" tomorrow. Write your string-to-internal-structure convertor in the form of a named block of code (a sub, method, proc, function depending on the language you're using). That makes for really easy changes and maintainance later on.
RULE 3 - USE REVERSE ENGINEERING TO SEPARATE OUT THE DATA
With more complex data, it may not be trivial to convert from a string. I'll list some techniques below, but overall you're best advised to look at the data and how it was put together and then reverse that process to extract it into your internal format. That makes for more robust results and fewer special cases that you need to look at. Elsewhere, I've described this as "type 1", "type 2" and "type 3" data handling; "Type 1" is doing enough to handle the specific data set you have. "Type 2" handles most cases but will get caught by some special cases. "Type 3" takes whatever the user throws at it, and is acheived by reversing the procedure that's use to create the data strings in the first place.
For web based applications, or for other where user input is involved, you should strive for "type 3" data handling. It should be much more secure against attack and it should get it right for correct inputs every time; think of the saving that'll make in irritating support / change requests if you can achieve it.
TECHNIQUE 1 - SPLITTING AT A KNOWN DELIMITER
If an input line has the same (or same pattern) delimiter throughout, your best technique for data extraction will be a split (or whatever it's called in the language that you're using). With a split function, you specify a delimiter character or pattern (regular expression) that's used as a separator and you burst the line into a list (array) of fields based on the separator. Common separators are
A single space
A comma (perhaps with spaces too)
One or more white space characters (tab or spaces in any combo)
A single tab (often an Excel saved file)
harry PHP Python Java
ivan Ruby Java Perl Tcl/Tk MySQL
john PHP XML Java Perl
ken Tcl/Tk Python Java Perl
URL encoded strings should be decoded using a double splitting technique - firstly split at & then each element you've got back at =.
TECHNIQUE 2 - REGULAR EXPRESSION EXTRACTION
Where an input line is a composite, then a regular expression extraction may be most approrpriate. Look at this data (from a web server log file):
flipper - - [22/Jul/2004:06:08:17 -0400] "GET /wedding/893525782_swanage.jpg HTTP/1.0" 404 -
magnet - - [22/Jul/2004:06:08:19 -0400] "GET /wedding/889799488_oxford.jpg HTTP/1.0" 404 -
aviemore - - [22/Jul/2004:06:08:20 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
seal - - [22/Jul/2004:06:09:03 -0400] "GET /wedding/index.html HTTP/1.0" 200 765
walrus - - [22/Jul/2004:06:09:05 -0400] "GET /wedding/888142130_doniford.jpg HTTP/1.0" 404 -
flipper - - [22/Jul/2004:06:09:06 -0400] "GET /wedding/1.html HTTP/1.0" 200 504
At first, it looks like you can split at white space but it's not that simple; there's an illogical white space in the square bracketed area which really should be one parameter, and the number of spaces in the request area can vary sometimes. If you've got an extended log file format, then there's more awkward spaces too:
h5n5c1o254.bredband.skanova.com - - [31/Aug/2003:00:56:40 -0700] "GET /net/pricepolicy.html HTTP/1.1" 200 6647 "http://search.yahoo.com/search?p=How+to+calculate+your+pricing+policy&sub=Search&ei=UTF-8&fr=fp-top&b=21" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Best approach here? A series of regular expression matches and / or splits to break the line down into the required components.
TECHNIQUE 3 - PARSING
Where data towards the end of your record depends on what's found earlier on in that record, you'll be best to use a parsing technique which adapts the data handling as it rusn to what it has already found. The classic example of an type of input that needs parsing is a computer programming language, but the technique also applies (and over multiple lines) to technologies like XML.
TECHNIQUE 4 - CSV
Data sometimes comes with a separator that may also occur within an individual field, and then the fields may be surrounded by an optional "protection" character to indicate where a character that looks like a delimiter really isn't.
This is normally known as a CSV (comma separated variable) format, but you'll find such data files written with tabs and spaces too. Here's a file example
"Hemlock, Western",N America,pale brown,500,4,n,"construction, joinery"
"Larch, Japanese",Europe,reddish brown,560,3,n,"stakes, construction"
Parana Pine,S America,golden brown and red streaks,550,4,y,"interior joinery,plywood"
"Pine, Corsican", Europe,light yellow-brown,510,4,n,"joinery, construction"
"Pine, maritime",Europe, pale brown to yellow,510,3,n,"pallets,packaging"
"Pine, pitch",South USA,yellow-brown to red-brown,670,3,n,heavy construction
Handling of CSV files can be a bit messy; in PHP there are special built in routines to help you, in Perl look on the CPAN or "roll your own". It gets all the more complex if you need to support if the protection character you're using can occur within the protected strings ....
TESTING
Test, test and test again ;-)
a) Simple cases
To check the mechanism try some simple cases
b) gentle data
A few lines of data with nothing "nasty" in there, graduating through to lines that include things that might be a bit awkward - alternative but allowed delimers and data errors for example.
c) Special / awkward cases
Especially if your input is public facing / provided look for all the special cases and awkward cusses you can see and try them out. At best, every last one should work. At worst, errors should be "safe" and not cause any security problems. Awkward inputs might include special cases with missing fields, strings that include back and forward quotes (beware injection attacks) and inputs that might be use internally to make up file names that the user could specify as ../../../windows/cmd.exe or similar!
d) Whole big files of data
Throw in whole big files of data - whatever you've got there from the past - and see that they don't find anything unexpected; whilst you cannot check every single rerord read manually, you could use something like a we command or a grep command at opertaing system level to generate you some statistics about your inputs then ensure that the same thing applies to the data that you've got within your program
See also
Perl for larger projects