Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
Python and Tcl - public course schedule [here]
Private courses on your site - see [here]
Please ask about maintenance training for Perl, PHP, Lua, etc
Refactoring - a PHP demo becomes a production page

"Refactoring" is a term that I've come across in Extreme Programming, but it's also a relevant topic to consider through the life cycle of any software. Perhaps I had better give a definition ....

Refactoring - the updating / alteration of software or systems, usually done in order to take into account a changing requirement.

A couple of years ago, I wrote a little demonstration during a course that took our daily web site log file, analysed it,and reported on the most popular pages on our web site. In those days, the daily log files were around 2 Mbytes each but that has now risen dramatically - it's been over 30 Mbytes per day for the last 3 days, and that means that techniques that I used in my initial demonstration - quick and easy to write, but relatively slow to run - are no longer totally appropriate. And at the same time as the data increasing, I've extended the program's output from a demonstration of reporting the most popular pages into a much more thorough analysis of web server accesses - looking at accesses to our web server by country, and also what proportion of our traffic is from robots. All of which has meant refactoring the code as it has progressed - an ongoing process .... Today, it's a page that provides us with a whole lot of information about our most visited pages.

What are some of the aspects involved?

a) Moving from a "recalculate everything each time" type operation to one where elements of caching are involved. This is at two levels.

Firstly, within each analysis - once we have identified a visiting IP to be from a particular country and calculated whether or not it's a spider, we retain that information through the rest of the file analysis on the basis that the IP address cannot move country against our fixed lookup scheme, and that it's very improbable that the same IP would be used for both a regular visitor and a spider.

Secondly (And not yet implemented as I write), there is little point in repeating the analysis many times each day for a log file that turns over in the middle of each night. Better to store the results of analysing the huge file and read the analysis results to produce the report that to re-analyse every time. Do note, though, that we don't produce a static file we can simply save as our script does allow a variety of parameters to be passed to it to tailor the report.

b) Breaking out data to include files. Early on in the life of our script, we added a few lines of code to test the browser - to see if the user agent string contained something like "MSIE" in which cas we could identify the visitor as being Microsoft Internet Explorer. That same logic is also shared by our recent visitors page.

By moving the table of browsers out into a separate file, we can now include an ever expanding and changing table of browser strings from a single source in both applications - and can indeed easily update it to provide further browser data without having to change several files that it's hidden in the middle of.

c) Moving from efficiency of coding to efficiency of running. For the analysis of a small data file, a simple set of regular expression matches to work out which user agent is a robot, and which is a real user, sufficed. But that gets very slow - especially where there's likely to be a very large number of different strings. The code has been modified to use a much faster strpos to identify certain common browsers without the need for a regular expression at all ... all meaning that the work can be done within the time the user would expect to be taken for a web page refresh.

Here's an example - showing both caching and efficiency changes - from within our script:

if ($spip[$line_els[0]]) {
    $isspider = $spip[$line_els[0]];
  } else {
    $isspider = 1;
    while (1) {
      if (strpos($line,'MSIE')) break;
      if (strpos($line,'Firefox')) break;
      if (strpos($line,'Safari')) break;
      if (eregi($spider_reg,$line)) $isspider = 2;
    $spip[$line_els[0]] = $isspider;

You'll note that we use the array $spip as a cache of data about which IP addresses are used by spiders - taking data from that cache if it's available in preference to doing a more complex analysis. When we do the analysis, we use strpos calls to rapidly eliminate the most common browsers before we go on and match to a (quite complex) regular expression that we have made up from the contents of a browser include file. Here is the include file ...

<?php # Browser Identity Strings - Spot the Spider!
$browsers = array (
"firefox" => "Firefox",
"iceweasel" => "Iceweasel",
"safari" => "Safari",
"netscape" => "Netscape",
"konqueror" => "Konqueror",
"opera" => "Opera",
"NutchCVS" => "Nutch Spider",
"wget" => "Wget",
"msnbot" => "MSN Spider",
"googlebot" => "Google Spider",
"us/ysearch/slurp" => "Yahoo Spider",
"WISEnutbot" => "Looksmart Spider",
"Ask Jeeves/Teoma" => "Ask Jeeves Spider",
"Naverbot" => "NaverBot Spider",
"www.almaden.ibm.com" => "IBM Almaden Spider",
"findlinks" => "Findlinks Spider",
"SocietyRobot" => "E Society Spider",
"ia_archiver" => "ia_archiver Spider",
"Accoona-AI-Agen" => "Accoona Spider",
"psbot" => "psbot Spider",
"seekbot" => "seekbot Spider",
"aipbot" => "aipbot Spider",
"rssimagesbot" => "rssimagesbot Spider",
"happyfunbot" => "happyfunbot Spider",
"msie" => "Internet Explorer",
"Twiceler" => "Twiceler Scraper / Spider",
"Xerka WebBot" => "Xerka WebBot / Spider",
"Yanga WorldSearch Bot" => "Yanga WorldSearch Spider",
"ShopWiki" => "Shop Wiki Spider",
"MJ12bot" => "Majestic 12 Spider",
"Gigabot" => "Gigabot Spider");

... please feel free to use these user agents which I have found amongst those on our site!
(written 2008-09-12, updated 2008-09-15)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
H115 - Designing PHP-Based Solutions: Best Practice
  [4691] Real life PHP application using our course training MVC example - (2016-06-05)
  [4641] Using an MVC structure - even without a formal framework - (2016-02-07)
  [4326] Learning to program - comments, documentation and test code - (2014-11-22)
  [4118] We not only teach PHP and Python - we teach good PHP and Python Practice! - (2013-06-18)
  [4069] Even early on, separate out your program from your HTML! - (2013-04-25)
  [3926] Filtering PHP form inputs - three ways, but which should you use? - (2012-11-18)
  [3820] PHP sessions - a best practice teaching example - (2012-07-27)
  [3813] Injection Attacks - PHP, SQL, HTML, Javascript - and how to neutralise them - (2012-07-22)
  [3539] Separating program and artwork in PHP - easier maintainance, and better for the user - (2011-12-05)
  [2679] How to build a test harness into your PHP - (2010-03-16)
  [2430] Not just a PHP program - a good web application - (2009-09-29)
  [2221] Adding a newsfeed for your users to a multipage PHP application - (2009-06-06)
  [2199] Improving the structure of your early PHP programs - (2009-05-25)
  [1694] Defensive coding techniques in PHP? - (2008-07-02)
  [1623] PHP Techniques - a workshop - (2008-04-26)
  [1533] Short and sweet and sticky - PHP form input - (2008-02-06)
  [1490] Software to record day to day events and keep an action list - (2007-12-31)
  [1487] Efficient PHP applications - framework and example - (2007-12-28)
  [1482] A story about benchmarking PHP - (2007-12-23)
  [1391] Ordnance Survey Grid Reference to Latitude / Longitude - (2007-10-14)
  [1390] Converting from postal address to latitude / longitude - (2007-10-13)
  [1389] Controlling and labelling Google maps via PHP - (2007-10-13)
  [1381] Using a MySQL database to control mod_rewrite via PHP - (2007-10-06)
  [1323] Easy handling of errors in PHP - (2007-08-27)
  [1321] Resetting session based tests in PHP - (2007-08-26)
  [1194] Drawing hands on a clock face - PHP - (2007-05-19)
  [1182] Painting a masterpiece in PHP - (2007-05-10)
  [1181] Good Programming practise - where to initialise variables - (2007-05-09)
  [1166] Back button - ensuring order are not submitted twice (PHP) - (2007-04-28)
  [1052] Learning to write secure, maintainable PHP - (2007-01-25)
  [1047] Maintainable code - some positive advice - (2007-01-21)
  [945] Code quality counts - (2006-11-26)
  [936] Global, Superglobal, Session variables - scope and persistance in PHP - (2006-11-21)
  [896] PHP - good coding practise and sticky radio buttons - (2006-10-17)
  [839] Reporting on the 10 largest files or 10 top scores - (2006-08-20)
  [572] Giving the researcher power over database analysis - (2006-01-22)
  [563] Merging pictures using PHP and GD - (2006-01-13)
  [426] Robust checking of data entered by users - (2005-08-27)
  [394] A year on - should we offer certified PHP courses - (2005-07-28)
  [340] Code and code maintainance efficiency - (2005-06-08)
  [261] Putting a form online - (2005-03-29)
  [237] Crossfertilisation, PHP to Python - (2005-03-06)
  [123] Short underground journeys and a PHP book - (2004-11-19)

H310 - PHP - Putting it all together
  [3454] Your PHP website - how to factor and refactor to reduce growing pains - (2011-09-24)
  [2931] Syncronise - software, trains, and buses. Please! - (2010-08-22)
  [2635] A PHP example that lets your users edit content without HTML knowledge - (2010-02-14)
  [2275] Debugging multipage (session based) PHP applications - (2009-07-09)
  [1962] Index Card System for Game Characters in PHP - (2008-12-27)
  [1840] Validating Credit Card Numbers - (2008-10-14)
  [1754] Upgrade from PHP 4 to PHP 5 - the TRY issue - (2008-08-15)
  [1716] Larger applications in PHP - (2008-07-22)
  [687] Presentation, Business and Persistence layers in Perl and PHP - (2006-04-17)
  [468] Stand alone PHP programs - (2005-10-18)

Back to
Which country does a search engine think you are located in?
Previous and next
Horse's mouth home
Forward to
What have iTime, honeytrapagency and domain listing center got in common?
Some other Articles
What does an browser understand? What does an HTML document contain?
I have been working hard but I do not expect you noticed
libwww-perl and Indy Library in your server logs?
What have iTime, honeytrapagency and domain listing center got in common?
Refactoring - a PHP demo becomes a production page
Which country does a search engine think you are located in?
All the pieces fall into place - hotel and courses
The road ahead - Python 3
Sharing variables with functions, but keeping them local too - Python
Looking for a value in a list - Python
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page

This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2019: 404 The Spa • Melksham, Wiltshire • United Kingdom • SN12 6QL
PH: 01225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/1794_Ref ... -page.html • PAGE BUILT: Sat May 27 16:49:10 2017 • BUILD SYSTEM: WomanWithCat