Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Python and Tcl - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Lua, etc
Nasty Characters in Web Applications


Read a form from your web site, and you're giving your user the ability to enter practically any text that he or she chooses. You'll get correct answers and incorrect ones, short entries and long ones, ones that comprise just letters and digits and others that include all the colours of the special character rainbow - from < and " through # and & to pound signs.

The sort of form you might be providing would be:

<form action=/cgi-bin/demo/echodemo.pl>
Please enter your name <input name=person><br>
and your email address<input name=email><br>
and your comment <textarea name=says rows=10 cols=40>
and then <input type=submit>

The input data from this form will appear in the QUERY_STRING environment variable in your program (And I'm going to use Perl for this example, as a question on Perl has encouraged the writing of this article) ... but subject to two constraints / conditions:

a) There's a limit - we used to say 1K but it may be up to 2K on some systems - of the amount of data that can be passed in this way and held in an environment variable. So long entries will be truncated a little short of the limit (allowing for the rest of the URL)

b) The input string is "URL encoded" as it's passed back to your program so that any user entries of special characters such as & = + (and quite a few others) are replaced by a three character sequence that comprises a % character followed by two hex digits to represent the character.

Here's a program to catch the inputs from the form above ... and you'll find as predicted that it "cuts off" around the 2k of text mark, and encodes a wide range of special characters.


print "Content-type: text/html\n\n";

print <<"HEADER"
<head><title>Echo demo - for Opentalk answer</title></head>
<h2>Echo test - demonstration of form entry</h2>

print "Raw echo gives ",$ENV{QUERY_STRING};

print <<"FOOTER"
Example by <a href=http://www.wellho.net>Well House Consultants</a>

I entered

Graham Ellis
Why *does* this give a problem?

and I got the response

Raw echo gives person=Graham+Ellis&email=graham%40wellho.net&says=Why+*does*+this+give+a+problem%3F

The encoding is done automatically by the browser and it's a necessary evil - since the browser sends character strings back, some form of delimiter is needed and the & and = characters were chosen in the early days of the web ... thus for those two characters (at least) some form of encoding had to be devised.


Scheme ...
 * Split the income string at & characters (to give a series of name / value pairs)
 * Split each name / value pair at the = character
 * Replace all + characters in the value with spaces
 * Replace all 3 character sets starting with % with the appropriate special char


As well as the default (GET) method, forms can be submitted to a server using the POST method ... you simply add method=POST as an attribute in your open form tag ... then modify you receiving script
 * To check which input method was used
 * and to read from STDIN if it was the POST method

If you're not using a form (i.e. the data forms a part of the URL after a ? character), you are stuck with the GET limit, I'm afraid. Also, you can't bookmark or provide an "href" type link to a completed form with the POST method.


Here's a sample piece of code that deals with the decoding and GET and POST methods as described above .... it's needed in virtually every CGI script you'll write and it's best provided as a separate sub that you load in all your applications from a standard file though use or require.

sub collect_form {

        $form{"method"} = "POST";
} else {
        $buffer = $ENV{QUERY_STRING};
        $form{"method"} = "GET";

@fof = split(/&/,$buffer);
foreach $field(@fof) {
        ($name,$value) = split(/=/,$field);
        $value =~ tr/+/ /;
        $value =~ s/%([a-fA-F0-9]{2})/pack("C",hex($1))/eg;
        if ($form{$name}) {
                $form{$name} .= "\n$value";
        } else {
                $form{$name} = $value;

return %form



Technically, field names are also encoded so you might also want to apply the tr and s lines to the $name variable if you're being pedantic

It is possible for a form to be written with several fields of the same name - indeed, if you use the MULTIPLE attribute on a SELECT input, you are bound to get such an input. That's why the script example checks for the same element occurring several times.

We've added an extra response in the form hash to tell the caller of the routine which method was used ... it's a service he may want.

The tr and s lines look complex - but you can just copy those and not worry too much about how they work ;-)


So you've got your information, correctly, into variables in your Perl program? Good, but you're not done with the nasty characters yet!

If you're saving your data into a database, you'll need to add in extra \ characters to protect characters such as the double quote from the SQL interpreter

If you're echoing the information entered back to the user, you need to encode any characters that are special to HTML before you do so ... if you don't, a use who enters <h1> (for example) will have the rest of his echo page come back in headline size 1 ....

If you're going to save the input to a file of your own format, again you'll need to consider what the delimiter you use in your file is, and how to ensure that the user doesn't cause a problem if he uses it in this text

If you allow your user to specify a file name, you'll need to check that malicious users don't proceed the name with ../../ (or something like that) to try and access files outside the file system area that you're intending that they use.

If your user's entry forms part of an ongoing resources - for example if it's a contribution to a forum - you'll need to check it against your AUP (acceptable user policy). Remember that using just letters of the alphabet it's possible to write offensive text, incite violence and break the official secrets act.

See also More about the Common Gateway Interface

Please note that articles in this section of our web site were current and correct to the best of our ability when published, but by the nature of our business may go out of date quite quickly. The quoting of a price, contract term or any other information in this area of our website is NOT an offer to supply now on those terms - please check back via our main web site

Related Material

Perl - Extending Flexibility Using CGI
  [1365] Korn Shell scripts on the web - (2007-09-25)
  [1187] Updating a page strictly every minute (PHP, Perl) - (2007-05-14)
  [641] Simple but rugged form handling demo - (2006-03-10)
  [590] Danny and Donna are getting married - (2006-02-03)
  [426] Robust checking of data entered by users - (2005-08-27)

String Handling in PHP
  [4072] Splitting the difference with PHP - (2013-04-27)
  [4071] Setting up strings in PHP - (2013-04-27)
  [3790] Solution looking for a problem? Lookahead and Lookbehind - (2012-06-30)
  [3789] More than just matching with a regular expression in PHP - (2012-06-30)
  [3788] Getting more than a yes / no answer from a regular expression pattern match - (2012-06-30)
  [3534] Learning to program in PHP - Regular Expression and Associative Array examples - (2011-12-01)
  [3516] Regular Expression modifiers in PHP - summary table - (2011-11-12)
  [3515] PHP - moving from ereg to preg for regular expressions - (2011-11-11)
  [3424] Divide 10000 by 17. Do you get 588.235294117647, 588.24 or 588? - Ruby and PHP - (2011-09-08)
  [3020] Handling (expanding) tabs in PHP - (2010-10-29)
  [2629] Curly braces within double quoted strings in PHP - (2010-02-09)
  [2238] Handling nasty characters - Perl, PHP, Python, Tcl, Lua - (2009-06-14)
  [2165] Making Regular Expressions easy to read and maintain - (2009-05-10)
  [2046] Finding variations on a surname - (2009-02-17)
  [1799] Regular Expressions in PHP - (2008-09-16)
  [1613] Regular expression for 6 digits OR 25 digits - (2008-04-16)
  [1603] Do not SHOUT and do not whisper - (2008-04-06)
  [1533] Short and sweet and sticky - PHP form input - (2008-02-06)
  [1372] A taster PHP expression ... - (2007-09-30)
  [1336] Ignore case in Regular Expression - (2007-09-08)
  [1195] Regular Express Primer - (2007-05-20)
  [1058] PHP Regular expression to extrtact link and text - (2007-01-31)
  [1008] Date conversion - PHP - (2006-12-26)
  [728] Looking ahead and behind in a Regular Expression - (2006-05-22)
  [716] Evaluating arithmetic expressions in configuration files - (2006-05-10)
  [642] How similar are two words - (2006-03-11)
  [608] Don't expose your regular expressions - (2006-02-15)
  [589] Robust PHP user inputs - (2006-02-03)
  [574] PHP - dividing a string up into pieces - (2006-01-23)
  [560] The fencepost problem - (2006-01-10)
  [558] Converting between acres and hectares - (2006-01-08)
  [493] Running a Perl script within a PHP page - (2005-11-12)
  [463] Splitting the difference - (2005-10-13)
  [422] PHP Magic Quotes - (2005-08-22)
  [337] the array returned by preg_match_all - (2005-06-06)
  [54] PHP and natural sorting - (2004-09-19)
  [31] Here documents - (2004-08-28)

Python on the Web
  [4536] Json load from URL, recursive display, Python 3.4 - (2015-10-14)
  [4404] Which (virtual) host was visited? Tuning Apache log files, and Python analysis - (2015-01-23)
  [4089] Quick and easy - showing Python data hander output via a browser - (2013-05-15)
  [2365] Counting Words in Python via the web - (2009-08-18)
  [2238] Handling nasty characters - Perl, PHP, Python, Tcl, Lua - (2009-06-14)
  [1745] Moodle, Drupal, Django (and Rails) - (2008-08-08)
  [903] Pieces of Python - (2006-10-23)
  [433] FTP - how to make the right transfers - (2005-09-01)
  [426] Robust checking of data entered by users - (2005-08-27)
  [237] Crossfertilisation, PHP to Python - (2005-03-06)

Tcl/Tk - Tcl on the Web
  [4461] Reading from a URL, and reading Json, from your Tcl script - (2015-03-12)
  [2429] Tcl scripts / processes on a web server via CGI - (2009-09-27)
  [2238] Handling nasty characters - Perl, PHP, Python, Tcl, Lua - (2009-06-14)
  [2040] Error: Cant read xxxxx: no such variable (in Tcl Tk) - (2009-02-14)
  [1785] What is running on your network? (tcl and expect) - (2008-09-04)

resource index - Deployment
Solutions centre home page

You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum.

At Well House Consultants, we provide training courses on subjects such as Ruby, Lua, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2019: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01225 708225 • FAX: 01225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/solutions/general- ... tions.html • PAGE BUILT: Wed Mar 28 07:47:11 2012 • BUILD SYSTEM: wizard