|
Be careful of misreading server statistics
Here's a mystery for you.
Background
Over the past weekend, I was "fighting" server outages on another computer where - about once an hour - the httpd daemon appeared to be running away in some sort of hole or denial of service attack. Tricky one to find, as the temporary fix I had in place was in the form of a "heartbeat" script that killed all existing connections and freshened up the server. And when the server was busy, it was so much "treacle" that I couldn't run any Linux commands from a shell to see what was going on.
Mystery
I was aware from my heartbeat log of a total of around 20 seconds per hour during which the server was not accepting requests - that's about 0.5% of the time. Yet I had a user who was telling me that in his experience, downtime was around 10%. Wow - that's some scary figure, isn't it?
Any ideas?
Turns out to be a case of how you gather your statistics!
Solution
My heartbeat script clicks in at the start of every minute and if there's a problem it tidies up - 5 seconds. Having clicked in once, it then does a further precautionary clean the following "top" of minute, and perhaps if it's not sure that load levels are dropping as they should, the following minute. So in a bad hour, 4 outages of 5 seconds = 20 seconds.
It turns out that my user was running an automated script to check our server, again at the top of the minute. So he had syncronised his tests to our server in such a way that he always saw it during that brief clean up. Looking at his log activity later, I noticed that if he got a failure he had programmed in a second hit straight away to confirm it - so he was seeing 4/60 or 8/64 failures - that's 6.5% or 12.5% to report.
Lies, damned lies and statistics
This is a "object lesson" in being careful with statistics - at best, they're helpful and at worst they can give a totally incorrect picture. But I have to say that this example really took the biscuit!
Footnote - server issue solved. Availability now over 99.8% and the remaining outages in the last couple of days relate to me testing. (written 2008-05-28 06:47:30)
Associated topics are indexed under A606 - Web Application Deployment - Apache httpd - log files and log tools
Some other Articles
Korn shell - some nuggetsString, Integer, Array, Associative Array - ksh variablesSome useful variables and settings in the Korn ShellFarewell, Newcastle to Stavanger, Haugsund and BergenBe careful of misreading server statisticsA date for your diary - 16th July 2008The old sayings are the best (FSB)How do Google Ads work?Old Sarum airfield brings back fond memoriesls command - favourite options
|
2259 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46 at 50 posts per page
This is a page archived from The Horse's Mouth at
http://www.wellho.net/horse/ -
the diary and writings of Graham Ellis.
Every attempt was made to provide current information at the time the
page was written, but things do move forward in our business - new software
releases, price changes, new techniques. Please check back via
our main site for current courses,
prices, versions, etc - any mention of a price in "The Horse's Mouth"
cannot be taken as an offer to supply at that price.
Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).
|
|