HOW TO HANDLE ANYTHING THE USER THROWS AT YOU
Read a form from your web site, and you're giving your user the ability to enter practically any text that he or she chooses. You'll get correct answers and incorrect ones, short entries and long ones, ones that comprise just letters and digits and others that include all the colours of the special character rainbow - from < and " through # and & to pound signs.
The sort of form you might be providing would be:
<form action=/cgi-bin/demo/echodemo.pl>
Please enter your name <input name=person><br>
and your email address<input name=email><br>
and your comment <textarea name=says rows=10 cols=40>
</textarea><br>
and then <input type=submit>
</form>
The input data from this form will appear in the QUERY_STRING environment variable in your program (And I'm going to use Perl for this example, as a question on Perl has encouraged the writing of this article) ... but subject to two constraints / conditions:
a) There's a limit - we used to say 1K but it may be up to 2K on some systems - of the amount of data that can be passed in this way and held in an environment variable. So long entries will be truncated a little short of the limit (allowing for the rest of the URL)
b) The input string is "URL encoded" as it's passed back to your program so that any user entries of special characters such as & = + (and quite a few others) are replaced by a three character sequence that comprises a % character followed by two hex digits to represent the character.
Here's a program to catch the inputs from the form above ... and you'll find as predicted that it "cuts off" around the 2k of text mark, and encodes a wide range of special characters.
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<"HEADER"
<html>
<head><title>Echo demo - for Opentalk answer</title></head>
<body>
<h2>Echo test - demonstration of form entry</h2>
HEADER
;
print "Raw echo gives ",$ENV{QUERY_STRING};
print <<"FOOTER"
<hr>
Example by <a href=http://www.wellho.net>Well House Consultants</a>
</body>
</html>
FOOTER
I entered
Graham Ellis
graham@wellho.net
Why *does* this give a problem?
and I got the response
Raw echo gives person=Graham+Ellis&email=graham%40wellho.net&says=Why+*does*+this+give+a+problem%3F
The encoding is done automatically by the browser and it's a necessary evil - since the browser sends character strings back, some form of delimiter is needed and the & and = characters were chosen in the early days of the web ... thus for those two characters (at least) some form of encoding had to be devised.
DECODING AN ENCODED QUERY_STRING
Scheme ...
* Split the income string at & characters (to give a series of name / value pairs)
* Split each name / value pair at the = character
* Replace all + characters in the value with spaces
* Replace all 3 character sets starting with % with the appropriate special char
REMOVING THE 2K LIMIT
As well as the default (GET) method, forms can be submitted to a server using the POST method ... you simply add method=POST as an attribute in your open form tag ... then modify you receiving script
* To check which input method was used
* and to read from STDIN if it was the POST method
If you're not using a form (i.e. the data forms a part of the URL after a ? character), you are stuck with the GET limit, I'm afraid. Also, you can't bookmark or provide an "href" type link to a completed form with the POST method.
SAMPLE CODE
Here's a sample piece of code that deals with the decoding and GET and POST methods as described above .... it's needed in virtually every CGI script you'll write and it's best provided as a separate sub that you load in all your applications from a standard file though use or require.
sub collect_form {
if ($ENV{"REQUEST_METHOD"} eq "POST") {
read(STDIN,$buffer,$ENV{"CONTENT_LENGTH"});
$form{"method"} = "POST";
} else {
$buffer = $ENV{QUERY_STRING};
$form{"method"} = "GET";
}
@fof = split(/&/,$buffer);
foreach $field(@fof) {
($name,$value) = split(/=/,$field);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9]{2})/pack("C",hex($1))/eg;
if ($form{$name}) {
$form{$name} .= "\n$value";
} else {
$form{$name} = $value;
}
}
return %form
}
Notes:
Technically, field names are also encoded so you might also want to apply the tr and s lines to the $name variable if you're being pedantic
It is possible for a form to be written with several fields of the same name - indeed, if you use the MULTIPLE attribute on a SELECT input, you are bound to get such an input. That's why the script example checks for the same element occurring several times.
We've added an extra response in the form hash to tell the caller of the routine which method was used ... it's a service he may want.
The tr and s lines look complex - but you can just copy those and not worry too much about how they work ;-)
FIVE STINGS IN THE TAIL
So you've got your information, correctly, into variables in your Perl program? Good, but you're not done with the nasty characters yet!
If you're saving your data into a database, you'll need to add in extra \ characters to protect characters such as the double quote from the SQL interpreter
If you're echoing the information entered back to the user, you need to encode any characters that are special to HTML before you do so ... if you don't, a use who enters <h1> (for example) will have the rest of his echo page come back in headline size 1 ....
If you're going to save the input to a file of your own format, again you'll need to consider what the delimiter you use in your file is, and how to ensure that the user doesn't cause a problem if he uses it in this text
If you allow your user to specify a file name, you'll need to check that malicious users don't proceed the name with ../../ (or something like that) to try and access files outside the file system area that you're intending that they use.
If your user's entry forms part of an ongoing resources - for example if it's a contribution to a forum - you'll need to check it against your AUP (acceptable user policy). Remember that using just letters of the alphabet it's possible to write offensive text, incite violence and break the official secrets act.
See also
More about the Common Gateway Interface