Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Checking robots.txt from Python

The robots.txt file - which well behaved automata check to see whether they are welcome on a web site - has two directives in its base specification ' User-Agent and DisAllow. You will find some other directives used, and you will find some sites who have a robots.txt file that has blank lines after the User-Agent line, even though (in the specification) the block for a user agent ends at a blank line. These rules, and web master's lack of knowledge of the detail, mean that some sites don't have their robots exclusion file as effective as they would wish.

I have written a very short Python example here which reads a robots.txt file via http protocol, and analyses it to report on the active User-Agent and Disallow lines - not only as a sample program on today's Python Course, but also to allow me to do a quick sanity check of robots.txt files.

Features of this Python example include ...
• Checking the number of command line parameters
• Connecting to a remote web resource and reading it as it it was a file
• Use of exceptions
(written 2009-07-12)

 
Associated topics are indexed under
Y110 - Python - File Handling
  [3558] Python or Lua - which should I use / learn? - (2011-12-21)
  [3465] How can I do an FTP transfer in Python? - (2011-10-05)
  [3442] A demonstration of how many Python facilities work together - (2011-09-16)
  [3083] Python - fresh examples from recent courses - (2010-12-11)
  [2870] Old prices - what would the equivalent price have been in 1966? - (2010-07-14)
  [2011] Conversion of OSI grid references to Eastings and Northings - (2009-01-28)
  [1442] Reading a file multiple times - file pointers - (2007-11-23)
  [183] The elegance of Python - (2005-01-19)
  [114] Relative or absolute milkman - (2004-11-10)

W501 - Introduction to Web Site Structure
  [2552] Web site traffic - real users, or just noise? - (2009-12-26)
  [2214] Global Index to help you find resources - (2009-06-01)
  [2094] If you have a spelling mistake in your URL / page name - (2009-03-21)
  [1969] Search Engines. Getting the right pages seen. - (2009-01-01)
  [1686] FTP - how not to corrupt data (binary v ascii) - (2008-06-24)
  [1636] What to do if the Home Page is missing - (2008-05-08)
  [1431] Getting the community on line - some basics - (2007-11-13)
  [1198] From Web to Web 2 - (2007-05-21)
  [1176] A pu that got me into trouble - (2007-05-04)
  [1168] Moving out some of the web site bloat - (2007-04-29)
  [1031] robots.txt - a clue to hidden pages? - (2007-01-13)
  [1024] Web site - a refresh to improve navigation - (2007-01-07)
  [528] Getting favicon to work - avoiding common pitfalls - (2005-12-14)
  [332] Looking up IP addresses - (2005-06-01)

W603 - Web and Intranet - Server Side Technologies
  [2055] Effect on server when memory runs out and swapping starts - (2009-02-26)
  [1749] Using server side and client side programming together - (2008-08-11)
  [1615] PHP training courses every month - (2008-04-18)
  [1554] Online hotel reservations - Melksham, Wiltshire (near Bath) - (2008-02-24)
  [1365] Korn Shell scripts on the web - (2007-09-25)
  [1355] .php or .html extension? Morally Static Pages - (2007-09-17)
  [1020] Parallel processing in PHP - (2007-01-03)
  [732] Where is a web site visitor browsing from - (2006-05-24)
  [653] Easy feed! - (2006-03-21)
  [642] How similar are two words - (2006-03-11)


Back to
Python - using exceptions to set a fallback
Previous and next
or
Horse's mouth home
Forward to
Everyone is in the customer relations business
Some other Articles
New to programming? It is natural (but needless) for you to be nervous
Great new diagrams for our notes ... Python releases
Strings as collections in Python
Everyone is in the customer relations business
Checking robots.txt from Python
Python - using exceptions to set a fallback
Creating and iterating through Python lists
Understanding the new local government structure in Wiltshire
First courses for 2010
Python classes / courses - what version do we train on?
3597 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2012: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho

PAGE: http://www.wellho.net/mouth/2282_Che ... ython.html • PAGE BUILT: Fri Feb 3 14:16:04 2012 • BUILD SYSTEM: wizard