Home Accessibility Courses Twitter The Mouth Facebook Resources Site Map About Us Contact
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Refactoring - a PHP demo becomes a production page

"Refactoring" is a term that I've come across in Extreme Programming, but it's also a relevant topic to consider through the life cycle of any software. Perhaps I had better give a definition ....

Refactoring - the updating / alteration of software or systems, usually done in order to take into account a changing requirement.

A couple of years ago, I wrote a little demonstration during a course that took our daily web site log file, analysed it,and reported on the most popular pages on our web site. In those days, the daily log files were around 2 Mbytes each but that has now risen dramatically - it's been over 30 Mbytes per day for the last 3 days, and that means that techniques that I used in my initial demonstration - quick and easy to write, but relatively slow to run - are no longer totally appropriate. And at the same time as the data increasing, I've extended the program's output from a demonstration of reporting the most popular pages into a much more thorough analysis of web server accesses - looking at accesses to our web server by country, and also what proportion of our traffic is from robots. All of which has meant refactoring the code as it has progressed - an ongoing process .... Today, it's a page that provides us with a whole lot of information about our most visited pages.

What are some of the aspects involved?

a) Moving from a "recalculate everything each time" type operation to one where elements of caching are involved. This is at two levels.

Firstly, within each analysis - once we have identified a visiting IP to be from a particular country and calculated whether or not it's a spider, we retain that information through the rest of the file analysis on the basis that the IP address cannot move country against our fixed lookup scheme, and that it's very improbable that the same IP would be used for both a regular visitor and a spider.

Secondly (And not yet implemented as I write), there is little point in repeating the analysis many times each day for a log file that turns over in the middle of each night. Better to store the results of analysing the huge file and read the analysis results to produce the report that to re-analyse every time. Do note, though, that we don't produce a static file we can simply save as our script does allow a variety of parameters to be passed to it to tailor the report.

b) Breaking out data to include files. Early on in the life of our script, we added a few lines of code to test the browser - to see if the user agent string contained something like "MSIE" in which cas we could identify the visitor as being Microsoft Internet Explorer. That same logic is also shared by our recent visitors page.

By moving the table of browsers out into a separate file, we can now include an ever expanding and changing table of browser strings from a single source in both applications - and can indeed easily update it to provide further browser data without having to change several files that it's hidden in the middle of.

c) Moving from efficiency of coding to efficiency of running. For the analysis of a small data file, a simple set of regular expression matches to work out which user agent is a robot, and which is a real user, sufficed. But that gets very slow - especially where there's likely to be a very large number of different strings. The code has been modified to use a much faster strpos to identify certain common browsers without the need for a regular expression at all ... all meaning that the work can be done within the time the user would expect to be taken for a web page refresh.

Here's an example - showing both caching and efficiency changes - from within our script:

if ($spip[$line_els[0]]) {
    $isspider = $spip[$line_els[0]];
  } else {
    $isspider = 1;
    while (1) {
      if (strpos($line,'MSIE')) break;
      if (strpos($line,'Firefox')) break;
      if (strpos($line,'Safari')) break;
      if (eregi($spider_reg,$line)) $isspider = 2;
    $spip[$line_els[0]] = $isspider;

You'll note that we use the array $spip as a cache of data about which IP addresses are used by spiders - taking data from that cache if it's available in preference to doing a more complex analysis. When we do the analysis, we use strpos calls to rapidly eliminate the most common browsers before we go on and match to a (quite complex) regular expression that we have made up from the contents of a browser include file. Here is the include file ...

<?php # Browser Identity Strings - Spot the Spider!
$browsers = array (
"firefox" => "Firefox",
"iceweasel" => "Iceweasel",
"safari" => "Safari",
"netscape" => "Netscape",
"konqueror" => "Konqueror",
"opera" => "Opera",
"NutchCVS" => "Nutch Spider",
"wget" => "Wget",
"msnbot" => "MSN Spider",
"googlebot" => "Google Spider",
"us/ysearch/slurp" => "Yahoo Spider",
"WISEnutbot" => "Looksmart Spider",
"Ask Jeeves/Teoma" => "Ask Jeeves Spider",
"Naverbot" => "NaverBot Spider",
"www.almaden.ibm.com" => "IBM Almaden Spider",
"findlinks" => "Findlinks Spider",
"SocietyRobot" => "E Society Spider",
"ia_archiver" => "ia_archiver Spider",
"Accoona-AI-Agen" => "Accoona Spider",
"psbot" => "psbot Spider",
"seekbot" => "seekbot Spider",
"aipbot" => "aipbot Spider",
"rssimagesbot" => "rssimagesbot Spider",
"happyfunbot" => "happyfunbot Spider",
"msie" => "Internet Explorer",
"Twiceler" => "Twiceler Scraper / Spider",
"Xerka WebBot" => "Xerka WebBot / Spider",
"Yanga WorldSearch Bot" => "Yanga WorldSearch Spider",
"ShopWiki" => "Shop Wiki Spider",
"MJ12bot" => "Majestic 12 Spider",
"Gigabot" => "Gigabot Spider");

... please feel free to use these user agents which I have found amongst those on our site!
(written 2008-09-12, updated 2008-09-15)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
H310 - PHP - Putting it all together
  [468] Stand alone PHP programs - (2005-10-18)
  [687] Presentation, Business and Persistence layers in Perl and PHP - (2006-04-17)
  [1716] Larger applications in PHP - (2008-07-22)
  [1754] Upgrade from PHP 4 to PHP 5 - the TRY issue - (2008-08-15)
  [1840] Validating Credit Card Numbers - (2008-10-14)
  [1962] Index Card System for Game Characters in PHP - (2008-12-27)
  [2275] Debugging multipage (session based) PHP applications - (2009-07-09)
  [2635] A PHP example that lets your users edit content without HTML knowledge - (2010-02-14)
  [2931] Syncronise - software, trains, and buses. Please! - (2010-08-22)
  [3454] Your PHP website - how to factor and refactor to reduce growing pains - (2011-09-24)

H115 - Designing PHP-Based Solutions: Best Practice
  [123] Short underground journeys and a PHP book - (2004-11-19)
  [237] Crossfertilisation, PHP to Python - (2005-03-06)
  [261] Putting a form online - (2005-03-29)
  [340] Code and code maintainance efficiency - (2005-06-08)
  [394] A year on - should we offer certified PHP courses - (2005-07-28)
  [426] Robust checking of data entered by users - (2005-08-27)
  [563] Merging pictures using PHP and GD - (2006-01-13)
  [572] Giving the researcher power over database analysis - (2006-01-22)
  [839] Reporting on the 10 largest files or 10 top scores - (2006-08-20)
  [896] PHP - good coding practise and sticky radio buttons - (2006-10-17)
  [936] Global, Superglobal, Session variables - scope and persistance in PHP - (2006-11-21)
  [945] Code quality counts - (2006-11-26)
  [1047] Maintainable code - some positive advice - (2007-01-21)
  [1052] Learning to write secure, maintainable PHP - (2007-01-25)
  [1166] Back button - ensuring order are not submitted twice (PHP) - (2007-04-28)
  [1181] Good Programming practise - where to initialise variables - (2007-05-09)
  [1182] Painting a masterpiece in PHP - (2007-05-10)
  [1194] Drawing hands on a clock face - PHP - (2007-05-19)
  [1321] Resetting session based tests in PHP - (2007-08-26)
  [1323] Easy handling of errors in PHP - (2007-08-27)
  [1381] Using a MySQL database to control mod_rewrite via PHP - (2007-10-06)
  [1389] Controlling and labelling Google maps via PHP - (2007-10-13)
  [1390] Converting from postal address to latitude / longitude - (2007-10-13)
  [1391] Ordnance Survey Grid Reference to Latitude / Longitude - (2007-10-14)
  [1482] A story about benchmarking PHP - (2007-12-23)
  [1487] Efficient PHP applications - framework and example - (2007-12-28)
  [1490] Software to record day to day events and keep an action list - (2007-12-31)
  [1533] Short and sweet and sticky - PHP form input - (2008-02-06)
  [1623] PHP Techniques - a workshop - (2008-04-26)
  [1694] Defensive coding techniques in PHP? - (2008-07-02)
  [2199] Improving the structure of your early PHP programs - (2009-05-25)
  [2221] Adding a newsfeed for your users to a multipage PHP application - (2009-06-06)
  [2430] Not just a PHP program - a good web application - (2009-09-29)
  [2679] How to build a test harness into your PHP - (2010-03-16)
  [3539] Separating program and artwork in PHP - easier maintainance, and better for the user - (2011-12-05)
  [3813] Injection Attacks - PHP, SQL, HTML, Javascript - and how to neutralise them - (2012-07-22)
  [3820] PHP sessions - a best practice teaching example - (2012-07-27)
  [3926] Filtering PHP form inputs - three ways, but which should you use? - (2012-11-18)
  [4069] Even early on, separate out your program from your HTML! - (2013-04-25)
  [4118] We not only teach PHP and Python - we teach good PHP and Python Practice! - (2013-06-18)
  [4326] Learning to program - comments, documentation and test code - (2014-11-22)
  [4641] Using an MVC structure - even without a formal framework - (2016-02-07)
  [4691] Real life PHP application using our course training MVC example - (2016-06-05)

Back to
Which country does a search engine think you are located in?
Previous and next
Horse's mouth home
Forward to
What have iTime, honeytrapagency and domain listing center got in common?
Some other Articles
What does an browser understand? What does an HTML document contain?
I have been working hard but I do not expect you noticed
libwww-perl and Indy Library in your server logs?
What have iTime, honeytrapagency and domain listing center got in common?
Refactoring - a PHP demo becomes a production page
Which country does a search engine think you are located in?
All the pieces fall into place - hotel and courses
The road ahead - Python 3
Sharing variables with functions, but keeping them local too - Python
Looking for a value in a list - Python
4759 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 at 50 posts per page

This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/1794_.html • PAGE BUILT: Sun Oct 11 16:07:41 2020 • BUILD SYSTEM: JelliaJamb