The robots.txt file is designed to provide spiders and crawlers with a list of places they should NOT go - it's described as the "robot exclusion standard" file and its intent is to allow the webmaster to segregate his site into indexable and non-indexable.
But because it lists directorys to be excluded, robots.txt is often an excellent source of links people don't want to be found. I have numerous examples that I've seen (and will NOT reproduce here!) where directories that are not for public consumption are listed. And - in theory - I'm perfectly at liberty to read the site's robots.txt with a regular browser then step through the places that robots are excluded manually to see what's there.
If you want to protect areas of your site from prying eyes / accidental discovery, do NOT rely on robots.txt - use a passwording system or some other form of authentication.
Our robots.txt file - which I'll happily reproduce here - lists URLs that I don't mind people finding - I just don't want them indexed. So even if they're looking with malicious intent - which I doubt - they won't "get" anywhere.
#
# robots.txt file for www.wellho.net and www.wellho.co.uk
#
# we encourage robots to visit and index almost ALL documents
# but not any executable scripts.
#
User-agent: *
Disallow: /cgi-bin/
Disallow: /net/unique.html
So all robots are allowed anywhere EXCEPT to cgi scripts, which we don't want indexed. On our site, all such scripts change their reports regularly and depending on the information entered, and so it would be misleading to encourage indexers to list them.
The
/net/unique.html page is sortof-internal. It's generated by one of our site scripts and lists words that occur only once on the rest of the site. Purpose? to help us find spelling mistakes! I don't mind anyone seeing the page - and indeed I've just provide you with a link to it in this article - but people REALLY won't want to land there when they do a search!
(written 2007-01-13, updated 2007-01-17)
Associated topics are indexed under
W501 - Introduction to Web Site Structure [2552] Web site traffic - real users, or just noise? - (2009-12-26)
[2282] Checking robots.txt from Python - (2009-07-12)
[2214] Global Index to help you find resources - (2009-06-01)
[2094] If you have a spelling mistake in your URL / page name - (2009-03-21)
[1969] Search Engines. Getting the right pages seen. - (2009-01-01)
[1686] FTP - how not to corrupt data (binary v ascii) - (2008-06-24)
[1636] What to do if the Home Page is missing - (2008-05-08)
[1431] Getting the community on line - some basics - (2007-11-13)
[1198] From Web to Web 2 - (2007-05-21)
[1176] A pu that got me into trouble - (2007-05-04)
[1168] Moving out some of the web site bloat - (2007-04-29)
[1024] Web site - a refresh to improve navigation - (2007-01-07)
[528] Getting favicon to work - avoiding common pitfalls - (2005-12-14)
[332] Looking up IP addresses - (2005-06-01)
W603 - Web and Intranet - Server Side Technologies [2055] Effect on server when memory runs out and swapping starts - (2009-02-26)
[1749] Using server side and client side programming together - (2008-08-11)
[1615] PHP training courses every month - (2008-04-18)
[1554] Online hotel reservations - Melksham, Wiltshire (near Bath) - (2008-02-24)
[1365] Korn Shell scripts on the web - (2007-09-25)
[1355] .php or .html extension? Morally Static Pages - (2007-09-17)
[1020] Parallel processing in PHP - (2007-01-03)
[732] Where is a web site visitor browsing from - (2006-05-24)
[653] Easy feed! - (2006-03-21)
[642] How similar are two words - (2006-03-11)
P608 - Perl - Robots, Crawlers and Spiders [2402] Automated Browsing in Perl - (2009-09-11)
[2229] Do not re-invent the wheel - use a Perl module - (2009-06-11)
[2045] Does robots.txt actually work? - (2009-02-16)
Some other Articles
Longer hours and better value coursesThe new web site look spreadsEmpty at Easleigh, Missing at Melksham, Overflowing at OldfieldChronic fatigue help - a new discussion forumrobots.txt - a clue to hidden pages?Hotel for TrowbridgeOur search engine placement is dropping.Linux / Unix - process priority and niceCue the music, I'm happy.The Wheatsheaf 2, The Bell 0