Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
For 2021 - online Python 3 training - see ((here)).

Our plans were to retire in summer 2020 and see the world, but Coronavirus has lead us into a lot of lockdown programming in Python 3 and PHP 7.
We can now offer tailored online training - small groups, real tutors - works really well for groups of 4 to 14 delegates. Anywhere in the world; course language English.

Please ask about private 'maintenance' training for Python 2, Tcl, Perl, PHP, Lua, etc.
Nasty Characters in Web Applications


Read a form from your web site, and you're giving your user the ability to enter practically any text that he or she chooses. You'll get correct answers and incorrect ones, short entries and long ones, ones that comprise just letters and digits and others that include all the colours of the special character rainbow - from < and " through # and & to pound signs.

The sort of form you might be providing would be:

<form action=/cgi-bin/demo/echodemo.pl>
Please enter your name <input name=person><br>
and your email address<input name=email><br>
and your comment <textarea name=says rows=10 cols=40>
and then <input type=submit>

The input data from this form will appear in the QUERY_STRING environment variable in your program (And I'm going to use Perl for this example, as a question on Perl has encouraged the writing of this article) ... but subject to two constraints / conditions:

a) There's a limit - we used to say 1K but it may be up to 2K on some systems - of the amount of data that can be passed in this way and held in an environment variable. So long entries will be truncated a little short of the limit (allowing for the rest of the URL)

b) The input string is "URL encoded" as it's passed back to your program so that any user entries of special characters such as & = + (and quite a few others) are replaced by a three character sequence that comprises a % character followed by two hex digits to represent the character.

Here's a program to catch the inputs from the form above ... and you'll find as predicted that it "cuts off" around the 2k of text mark, and encodes a wide range of special characters.


print "Content-type: text/html\n\n";

print <<"HEADER"
<head><title>Echo demo - for Opentalk answer</title></head>
<h2>Echo test - demonstration of form entry</h2>

print "Raw echo gives ",$ENV{QUERY_STRING};

print <<"FOOTER"
Example by <a href=http://www.wellho.net>Well House Consultants</a>

I entered

Graham Ellis
Why *does* this give a problem?

and I got the response

Raw echo gives person=Graham+Ellis&email=graham%40wellho.net&says=Why+*does*+this+give+a+problem%3F

The encoding is done automatically by the browser and it's a necessary evil - since the browser sends character strings back, some form of delimiter is needed and the & and = characters were chosen in the early days of the web ... thus for those two characters (at least) some form of encoding had to be devised.


Scheme ...
 * Split the income string at & characters (to give a series of name / value pairs)
 * Split each name / value pair at the = character
 * Replace all + characters in the value with spaces
 * Replace all 3 character sets starting with % with the appropriate special char


As well as the default (GET) method, forms can be submitted to a server using the POST method ... you simply add method=POST as an attribute in your open form tag ... then modify you receiving script
 * To check which input method was used
 * and to read from STDIN if it was the POST method

If you're not using a form (i.e. the data forms a part of the URL after a ? character), you are stuck with the GET limit, I'm afraid. Also, you can't bookmark or provide an "href" type link to a completed form with the POST method.


Here's a sample piece of code that deals with the decoding and GET and POST methods as described above .... it's needed in virtually every CGI script you'll write and it's best provided as a separate sub that you load in all your applications from a standard file though use or require.

sub collect_form {

        $form{"method"} = "POST";
} else {
        $buffer = $ENV{QUERY_STRING};
        $form{"method"} = "GET";

@fof = split(/&/,$buffer);
foreach $field(@fof) {
        ($name,$value) = split(/=/,$field);
        $value =~ tr/+/ /;
        $value =~ s/%([a-fA-F0-9]{2})/pack("C",hex($1))/eg;
        if ($form{$name}) {
                $form{$name} .= "\n$value";
        } else {
                $form{$name} = $value;

return %form



Technically, field names are also encoded so you might also want to apply the tr and s lines to the $name variable if you're being pedantic

It is possible for a form to be written with several fields of the same name - indeed, if you use the MULTIPLE attribute on a SELECT input, you are bound to get such an input. That's why the script example checks for the same element occurring several times.

We've added an extra response in the form hash to tell the caller of the routine which method was used ... it's a service he may want.

The tr and s lines look complex - but you can just copy those and not worry too much about how they work ;-)


So you've got your information, correctly, into variables in your Perl program? Good, but you're not done with the nasty characters yet!

If you're saving your data into a database, you'll need to add in extra \ characters to protect characters such as the double quote from the SQL interpreter

If you're echoing the information entered back to the user, you need to encode any characters that are special to HTML before you do so ... if you don't, a use who enters <h1> (for example) will have the rest of his echo page come back in headline size 1 ....

If you're going to save the input to a file of your own format, again you'll need to consider what the delimiter you use in your file is, and how to ensure that the user doesn't cause a problem if he uses it in this text

If you allow your user to specify a file name, you'll need to check that malicious users don't proceed the name with ../../ (or something like that) to try and access files outside the file system area that you're intending that they use.

If your user's entry forms part of an ongoing resources - for example if it's a contribution to a forum - you'll need to check it against your AUP (acceptable user policy). Remember that using just letters of the alphabet it's possible to write offensive text, incite violence and break the official secrets act.

See also More about the Common Gateway Interface

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - Extending Flexibility Using CGI
  [426] - ()
  [590] - ()
  [641] - ()
  [1187] - ()
  [1365] - ()

String Handling in PHP
  [31] - ()
  [54] - ()
  [337] - ()
  [422] - ()
  [463] - ()
  [493] - ()
  [558] - ()
  [560] - ()
  [574] - ()
  [589] - ()
  [608] - ()
  [642] - ()
  [716] - ()
  [728] - ()
  [1008] - ()
  [1058] - ()
  [1195] - ()
  [1336] - ()
  [1372] - ()
  [1533] - ()
  [1603] - ()
  [1613] - ()
  [1799] - ()
  [2046] - ()
  [2165] - ()
  [2238] - ()
  [2629] - ()
  [3020] - ()
  [3424] - ()
  [3515] - ()
  [3516] - ()
  [3534] - ()
  [3788] - ()
  [3789] - ()
  [3790] - ()
  [4071] - ()
  [4072] - ()

Python on the Web
  [237] - ()
  [426] - ()
  [433] - ()
  [903] - ()
  [1745] - ()
  [2238] - ()
  [2365] - ()
  [4089] - ()
  [4404] - ()
  [4536] - ()

Tcl/Tk - Tcl on the Web
  [1785] - ()
  [2040] - ()
  [2238] - ()
  [2429] - ()
  [4461] - ()

resource index - Deployment
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2022: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/general- ... tions.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard