We had a problem when our web site access logs doubled in size in one day. It wasn't good selling on our part or a massive return to work after a public holiday; all the extra traffic came from a single location in (in our case) Slovakia.
HOW WE NOW PREVENT SUCH AGGRESSIVE BROWSING USING PHP
All of our pages use PHP and pull in a standard file of web helper routines (that's a chosen design feature that lets us make global changes very easily), so all I've had to do is to add a short section of code in there:
# Log current request
$nowsec = time();
$rip = $_SERVER[REMOTE_ADDR];
$wanted = $_SERVER[REQUEST_URI];
# Our database connection is already open ...
$logit = "INSERT INTO recent (tstamp, remoteaddr, calledfor) ".
"values (".
$nowsec . ", ".
"\"$rip\", ".
"\"$wanted\") ";
@mysql_query($logit);
# How many requests in the test time period
$keeptime = 120;
$hurdle = 50;
$nn = $nowsec - $keeptime;
$q = @mysql_query("select count(tstamp) from recent ".
"where remoteaddr = '$rip' and tstamp > $nn");
$res = @mysql_fetch_row($q);
$balloon = $res[0];
# $hurdle pages per $keeptime seconds - aggressive!
if ($balloon > $hurdle) {
@mysql_query("INSERT INTO warned (tstamp, remoteaddr, ".
"calledfor) values (".
$nowsec . ", ".
"\"$rip\", ".
"\"$wanted\") ");
sleep(10); # Keep 'em waiting!
# $response = file($_SERVER[DOCUMENT_ROOT]."/dos.html");
# print (join(" ",$response));
# exit();
}
$q = @mysql_query("delete from recent where tstamp < $nn");
# Keep database creation commands here so that we can
# have then to hand for when we port the software
/*
$q = @mysql_query("create table recent ( tstamp bigint,".
" remoteaddr text, calledfor text, ".
" rid bigint primary key not null auto_increment)");
$q = @mysql_query("create table warned ( tstamp bigint,".
" remoteaddr text, calledfor text, ".
" rid int primary key not null auto_increment)"); */
AN EXPLANATION
Our database connection is always opened when you call up a page on our web site, so the first thing we do is to add in a note of where every single request has come from. Normally, we have to take huge care in web programming (and PHP is a huge help) to avoid one request being linked with any others. On things like auction sites, and in dynamic traffic monitoring too, that's not the case so we're using a common database table.
Once the access has been logged, we check to see how many other accesses have come from the same location (IP address) in a chosen time period, and we see if a limit has been reached. We've chosen, for testing, a limit of 50 pages in 2 minutes - see below.
When our chosen threshold is reached, we log the "violation" to a separate database table and take our immediate action to deal with the heavy traffic.
Finally (in all cases), we delete records older than our threshhold from the table so that we don't end up with a monstrously growing database table that needs major work every so often.
HOW TO CHOOSE AND SET THE LIMITS.
A Tricky one! You want to catch people before they do too much harm, but be sure that you won't stop an important but slightly aggressive robot. The figures that we've chosen are slightly above the hit level we've been experiencing from the Google and MSN crawlers; both of these tend to come through in fits and starts, so that a limit of 30 hits per minute was occasionally triggering, but a lower limit (25 per minute) spread over 2 minutes seems OK.
Our visitor from Slovakia who provoked the writing of the code and article was grabbing a page every second for hours on end, and would clearly be trapped.
We also need to be aware that high traffic levels *can* be legitimate. Browsing our live web site from our own training centre, all requests appear to come from a single IP address; with a maximum class size of 7 trainees, plus three staff members, all browsing our site at the same time, the limiter would be hit if the average user called up more than one page every 24 seconds, consistently for 2 minutes.
ACTION TO TAKE WHEN LIMIT'S HIT
In our example above, we've simply put a 10 second delay into the code - a minimalist response that's intended to slow down the high traffic generator without effecting the content that they see.
Alternative code, commented out in our example, generates an alternative response page that warns the user that he's triggered our limits; "what use is that to a spider" you may ask - well - if the spidering is following links, then it somewhat trims down the number of different links to follow and provides an element of traffic control.
Other actions (not shown in the sample code) include:
= emailing the server admin when the limit is hit
(but beware the possibility of spamming yourself)
= "blacklisting" warned IP addresses so that they
can't quickly step their hits back up
We could also check the user agent - in other words the program that's being used to call up all the pages - and respond differently to known and welcomed crawlers that are getting a bit over-enthusiastic. It might even be a good idea to log who's checked the robots.txt file ...
BACK TO THE BEGINNING
So - what *was* the problem that triggered all this careful investigation and filtering?
Looking through our logs, it appears to be a user in a University in Slovakia who's using the Wget utility to grab a web page and everything it calls up too; I'm inclined to think that he / she found our site useful and decided to put a local copy on his / her own laptop for later use, but I may be wrong and it might be more sinister. In any case, he / she hasn't visited our robots.txt to see whether automata are welcome and where they may go, so I don't feel too bad about capping.
Did I try asking what was going on? Yes, but from web logs you can't identify the user and so I had to use a rather more blunt tool of writing to the sys admin at the site. As their web site isn't in my native tongue (and clicking on the word "English" on their from page gives a 404 error), I'm not holding out much hope that they'll even receive and understand my message.
Our immediate user will have simply found our pages to a bit slow and erratic if he continues to spider next week (I think he's off for the weekend as I write this on Sunday). If I choose to enable the alternative document that's commented out in my test code, he'll get:
<html>
<head><title>Well House Consultants -
probable aggressive spidering notice</title></head>
<body bgcolor=#FF9999 text=black>
<h1>Your IP address has requested more than 50 pages in
the two minutes</h1>
<b>We welcome spiders to index our site, but request
that they are "polite" - that they check our robots.txt
file, and that they crawl gently so as not to burn up all
our bandwidth and deny access to others.<br><br>
You have been sent this page because over 50 pages were
requested from our server within 120 seconds from the
same IP address. If you're spidering up, please adjust
your spider so that it's more gentle in its actions. This
"incident" has been logged, and we'll be taking a closer
look at our records.<br><br>
If you feel that you should not have received this
message, please email me (graham@wellho.net). Thanks!</b>
</body>
</html>
See also
Using MySQL from PHP