The robots.txt file - which well behaved automata check to see whether they are welcome on a web site - has two directives in its base specification '
User-Agent and
DisAllow. You will find some other directives used, and you will find some sites who have a robots.txt file that has blank lines after the
User-Agent line, even though (in the specification) the block for a user agent ends at a blank line. These rules, and web master's lack of knowledge of the detail, mean that some sites don't have their robots exclusion file as effective as they would wish.
I have written a very short Python example
here which reads a robots.txt file via http protocol, and analyses it to report on the active User-Agent and Disallow lines - not only as a sample program on today's
Python Course, but also to allow me to do a quick sanity check of robots.txt files.
Features of this Python example include ...
• Checking the number of command line parameters
• Connecting to a remote web resource and reading it as it it was a file
• Use of exceptions
(written 2009-07-12)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Y110 - Python - File Handling [114] Relative or absolute milkman - (2004-11-10)
[183] The elegance of Python - (2005-01-19)
[1442] Reading a file multiple times - file pointers - (2007-11-23)
[2011] Conversion of OSI grid references to Eastings and Northings - (2009-01-28)
[2870] Old prices - what would the equivalent price have been in 1966? - (2010-07-14)
[3083] Python - fresh examples from recent courses - (2010-12-11)
[3442] A demonstration of how many Python facilities work together - (2011-09-16)
[3465] How can I do an FTP transfer in Python? - (2011-10-05)
[3558] Python or Lua - which should I use / learn? - (2011-12-21)
[3764] Shell, Awk, Perl of Python? - (2012-06-14)
[4438] Loving programming in Python - and ready to teach YOU how - (2015-02-22)
[4451] Running an operating system command from your Python program - the new way with the subprocess module - (2015-03-06)
[4593] Command line parameter handling in Python via the argparse module - (2015-12-08)
[4663] Easy data to object mapping (csv and Python) - (2016-03-24)
[4708] Scons - a build system in Python - building hello world - (2016-10-29)
[4717] with in Python - examples of use, and of defining your own context - (2016-11-02)
W603 - Web and Intranet - Server Side Technologies [642] How similar are two words - (2006-03-11)
[653] Easy feed! - (2006-03-21)
[732] Where is a web site visitor browsing from - (2006-05-24)
[1020] Parallel processing in PHP - (2007-01-03)
[1031] robots.txt - a clue to hidden pages? - (2007-01-13)
[1355] .php or .html extension? Morally Static Pages - (2007-09-17)
[1365] Korn Shell scripts on the web - (2007-09-25)
[1554] Online hotel reservations - Melksham, Wiltshire (near Bath) - (2008-02-24)
[1615] PHP training courses every month - (2008-04-18)
[1749] Using server side and client side programming together - (2008-08-11)
[2055] Effect on server when memory runs out and swapping starts - (2009-02-26)
[3705] Django Training Courses - UK - (2012-04-23)
[3915] How does PHP work? - (2012-11-07)
[4277] Sending a message to the server and changing text on a page when a button is pressed - (2014-05-23)
W501 - Introduction to Web Site Structure [332] Looking up IP addresses - (2005-06-01)
[528] Getting favicon to work - avoiding common pitfalls - (2005-12-14)
[1024] Web site - a refresh to improve navigation - (2007-01-07)
[1168] Moving out some of the web site bloat - (2007-04-29)
[1176] A pu that got me into trouble - (2007-05-04)
[1198] From Web to Web 2 - (2007-05-21)
[1431] Getting the community on line - some basics - (2007-11-13)
[1636] What to do if the Home Page is missing - (2008-05-08)
[1686] FTP - how not to corrupt data (binary v ascii) - (2008-06-24)
[1969] Search Engines. Getting the right pages seen. - (2009-01-01)
[2094] If you have a spelling mistake in your URL / page name - (2009-03-21)
[2214] Global Index to help you find resources - (2009-06-01)
[2552] Web site traffic - real users, or just noise? - (2009-12-26)
Some other Articles
New to programming? It is natural (but needless) for you to be nervousGreat new diagrams for our notes ... Python releasesStrings as collections in PythonEveryone is in the customer relations businessChecking robots.txt from PythonPython - using exceptions to set a fallbackCreating and iterating through Python listsUnderstanding the new local government structure in WiltshireFirst courses for 2010Python classes / courses - what version do we train on?