The robots.txt file - which well behaved automata check to see whether they are welcome on a web site - has two directives in its base specification '
User-Agent and
DisAllow. You will find some other directives used, and you will find some sites who have a robots.txt file that has blank lines after the
User-Agent line, even though (in the specification) the block for a user agent ends at a blank line. These rules, and web master's lack of knowledge of the detail, mean that some sites don't have their robots exclusion file as effective as they would wish.
I have written a very short Python example
here which reads a robots.txt file via http protocol, and analyses it to report on the active User-Agent and Disallow lines - not only as a sample program on today's
Python Course, but also to allow me to do a quick sanity check of robots.txt files.
Features of this Python example include ...
• Checking the number of command line parameters
• Connecting to a remote web resource and reading it as it it was a file
• Use of exceptions
(written 2009-07-12)
Associated topics are indexed under
Y110 - Python - File Handling [3558] Python or Lua - which should I use / learn? - (2011-12-21)
[3465] How can I do an FTP transfer in Python? - (2011-10-05)
[3442] A demonstration of how many Python facilities work together - (2011-09-16)
[3083] Python - fresh examples from recent courses - (2010-12-11)
[2870] Old prices - what would the equivalent price have been in 1966? - (2010-07-14)
[2011] Conversion of OSI grid references to Eastings and Northings - (2009-01-28)
[1442] Reading a file multiple times - file pointers - (2007-11-23)
[183] The elegance of Python - (2005-01-19)
[114] Relative or absolute milkman - (2004-11-10)
W501 - Introduction to Web Site Structure [2552] Web site traffic - real users, or just noise? - (2009-12-26)
[2214] Global Index to help you find resources - (2009-06-01)
[2094] If you have a spelling mistake in your URL / page name - (2009-03-21)
[1969] Search Engines. Getting the right pages seen. - (2009-01-01)
[1686] FTP - how not to corrupt data (binary v ascii) - (2008-06-24)
[1636] What to do if the Home Page is missing - (2008-05-08)
[1431] Getting the community on line - some basics - (2007-11-13)
[1198] From Web to Web 2 - (2007-05-21)
[1176] A pu that got me into trouble - (2007-05-04)
[1168] Moving out some of the web site bloat - (2007-04-29)
[1031] robots.txt - a clue to hidden pages? - (2007-01-13)
[1024] Web site - a refresh to improve navigation - (2007-01-07)
[528] Getting favicon to work - avoiding common pitfalls - (2005-12-14)
[332] Looking up IP addresses - (2005-06-01)
W603 - Web and Intranet - Server Side Technologies [2055] Effect on server when memory runs out and swapping starts - (2009-02-26)
[1749] Using server side and client side programming together - (2008-08-11)
[1615] PHP training courses every month - (2008-04-18)
[1554] Online hotel reservations - Melksham, Wiltshire (near Bath) - (2008-02-24)
[1365] Korn Shell scripts on the web - (2007-09-25)
[1355] .php or .html extension? Morally Static Pages - (2007-09-17)
[1020] Parallel processing in PHP - (2007-01-03)
[732] Where is a web site visitor browsing from - (2006-05-24)
[653] Easy feed! - (2006-03-21)
[642] How similar are two words - (2006-03-11)
Some other Articles
New to programming? It is natural (but needless) for you to be nervousGreat new diagrams for our notes ... Python releasesStrings as collections in PythonEveryone is in the customer relations businessChecking robots.txt from PythonPython - using exceptions to set a fallbackCreating and iterating through Python listsUnderstanding the new local government structure in WiltshireFirst courses for 2010Python classes / courses - what version do we train on?