Search Engines. Getting the right pages seen.

"I've been visited by a robot - what do I do about it?" I chuckled at the question when I came across it yesterday when looking up the Robots Exclusion Standard ("also known as the Robots Exclusion Protocol"). The answer given was "do nothing" and I would agree - almost. Or I might suggest "Celebrate".

Why do we want search engines and other robots to visit us?

It's no good having the best web site in the world ... the most authoritative information on Druids Lodge and Ratfyn Junction ... the most easily understood explanation of a left join (as opposed to a join) ... the best value Linux courses in Melksham ... your excellent picture of a sliced orange or unique illustration of Balloons at Beach ... if no-one can find it!.

You put links in, of course, as you write and update your site but - let's face it - the web is a huge place, and the chance of someone who's looking for "mood" pictures of California stumbling across reading your blog and finding this page is pretty slim ... unless they're helped. And that's where the search engine robots come in.

The Search Engine Robots are automated programs that read a page from your web site, and store in their indexes the keywords and other data from that page. Then they follow the links on that page, and do the same with the pages that they find there. And then the next level of pages. And so on. Thus building up a map of your site. [Technical notes - they have special mechanisms in place to avoid re-checking the same pages too often, and to avoid making so many requests all at once that they effect your web server's performance].

Once a search engine has started to map your page, it will start to offer them in search results where it sees that they are appropriate. And there's your gain from having the robotic search engine visiting lots of your pages to the extent that it's a significant part of your traffic (our robot traffic is around 40%; I have heard others quote as high as 90%!).

How do I tell search engines where to go?

Remember my "do nothing but celebrate" statement? Well - there's something of the same answer for you here. Provided that you pages are all linked in, on a straightforward web site you'll find that the search engines do a pretty good job of finding the pages, and since it is in their interests as well as yours to present relevant results both you and them are working to the same goal - automatically.

But you can help them along somewhat if you want.

Firstly, there may be pages that you do NOT want them to index - they shouldn't follow that link to your private staff area (which in any case requires a login to see staff-critical stuff, right?), and there's little point in them following all the colour and font size changes that you have on every page. You can instruct well behaved robots to avoid certain places through the robots exclusion protocol - provide a URL (almost inevitably a plain file) on your site with the name /robots.txt for this purpose. Here is the "meat" of our file:

User-agent:  *

Disallow: /cgi-bin/

Disallow: /net/unique.html

Disallow: /happens/

Disallow: /resources/mywellho.html

Disallow: /net/search.php4

And it excludes all well behaved robots / search bots / crawlers / spiders from certain places. I have placed a commented version of our robots.txt file in our Web Site Structure resources index. You can read more about the file format and use here, and there's a file checker here which lets you validate yours.

A second way to tell the search engines where they may go is to provide them with a sitemap file. Google accepts a number of site map formats - the easiest is probably a plain text file, but others such as the sitemaps protocol allow you to provide some extra information, and are re-useable across a number of search engines. Here are some sample records from a sitemap file:

<url>

<loc>http://www.wellho.net/resources/P668.html</loc>

<lastmod>2008-12-25</lastmod>

<priority>0.808</priority>

<changefreq>weekly</changefreq>

</url>

<url>

<loc>http://www.wellho.net/pix/mcem02.jpg</loc>

<lastmod>2008-08-15</lastmod>

<priority>0.383</priority>

<changefreq>yearly</changefreq>

</url>

Google - the largest search engine but by no means the only one - state that they do not use the sitemap to prioritise your site over someone else's but they DO use it as a helpful crawling map, and they do take note of your priority values when selecting which of your pages to offer to visitors.

There are tools to help you generate your own site map, and there's a sitemap checking tool available too. As I want to skew priorities on our own site and have an adaptive system that provides much extra information, I've chosen to experiment away from the tools. You'll find the full sitemap easily enough if you look in the usual places - or you can have a play with the human readable demo form that I've been using in testing here.

Search Engine Optimisation

This is another huge subject - and it's one that a great deal of time and money is thrown at. The sitemap and robots.txt files simply give the search engines guidance within your own site - but you'll want to have them offer your site in preference to others.

Some suggestions / ideas that work:
a) Provide plenty of good content
b) Write your site in good, clean HTML
c) Have lots of links internally and externally
d) Keep content changing, but not *too* often
e) Get lots of people to link in to you too
f) Get listed in lots of good places
g) Use keywords in the URLs and have a good site name

Some things that you should avoid:
a) having the same page appear under lots of different URLs
b) adding "search engine fodder" text in white on white
c) providing different pages to the search engine than to your user
d) having useful content that can only be accessed via cookies
e) copying text of other people's sites to bolster your content

It's not just a question of "how high can I get my rank", but "how can I get my rank high for the appropriate visitor". As an example of what I mean, our site provides a huge resource range that's useful the world over BUT I want my resources to be especially highly ranked for visitors for whom a trip to Melksham is feasible. With s top level domain of ".net" because we're NOT purely a UK company, and a service hosted in California, this wasn't working too well this time last year. A server move to a UK based host, though, has left our traffic levels static but increased the number of UK based IP addresses visiting us, at the expense of other countries.

What about other robots?

I started this article talking about robots in general, and then moved on to talking about search engines specifically. What other bots are there?

A lot of them are specialised search engines - ranging from specific industry tools to others such as the Turn it in bot which crawls the world indexing texts so that the company running it can sell their services to academic establishments who want to check for plagiarism.

With some of these more obscure search "bots", there are questions as to whether it's really worth the bandwidth in allowing them through - and indeed if a "bot" such as Turnitin grabs 4000 pages of your site one day, it could be at the expense of performance, at cost to your pocket, just to help those commercial people at "turnitin" make a bit more money. See further discussion on this and add something like

User-agent: TurnitinBot

Disallow: /

to your robots.txt to ban a specific robot (but a well behaved one with regards to it using the file).

There are some very nasty spiders / robots around too. The same technology can be used to crawl the web and harvest email addresses for spam mailing lists, and to look for holes in your web site. See more about those nasty programs and how you can sanitise your applications against them. Regrettably, by their very nature they won't respect your robots.txt and may even make use of it to find out which directories you would rather not have crawled!
(written 2009-01-01, updated 2009-01-02)

Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles

W501 - Introduction to Web Site Structure
  [332] Looking up IP addresses - (2005-06-01)
  [528] Getting favicon to work - avoiding common pitfalls - (2005-12-14)
  [1024] Web site - a refresh to improve navigation - (2007-01-07)
  [1031] robots.txt - a clue to hidden pages? - (2007-01-13)
  [1168] Moving out some of the web site bloat - (2007-04-29)
  [1176] A pu that got me into trouble - (2007-05-04)
  [1198] From Web to Web 2 - (2007-05-21)
  [1431] Getting the community on line - some basics - (2007-11-13)
  [1636] What to do if the Home Page is missing - (2008-05-08)
  [1686] FTP - how not to corrupt data (binary v ascii) - (2008-06-24)
  [2094] If you have a spelling mistake in your URL / page name - (2009-03-21)
  [2214] Global Index to help you find resources - (2009-06-01)
  [2282] Checking robots.txt from Python - (2009-07-12)
  [2552] Web site traffic - real users, or just noise? - (2009-12-26)

G911 - Well House Consultants - Search Engine Optimisation
  [165] Implementing an effective site search engine - (2005-01-01)
  [427] The Melksham train - a button is pushed - (2005-08-28)
  [1015] Search engine placement - long term strategy and success - (2006-12-30)
  [1029] Our search engine placement is dropping. - (2007-01-11)
  [1344] Catching up on indexing our resources - (2007-09-10)
  [1793] Which country does a search engine think you are located in? - (2008-09-11)
  [1971] Telling Google which country your business trades in - (2009-01-02)
  [1982] Cooking bodies and URLs - (2009-01-08)
  [1984] Site24x7 prowls uninvited - (2009-01-10)
  [2000] 2000th article - Remember the background and basics - (2009-01-18)
  [2019] Baby Caleb and Fortune City in your web logs? - (2009-01-31)
  [2045] Does robots.txt actually work? - (2009-02-16)
  [2065] Static mirroring through HTTrack, wget and others - (2009-03-03)
  [2106] Learning to Twitter / what is Twitter? - (2009-03-28)
  [2107] How to tweet automatically from a blog - (2009-03-28)
  [2137] Reaching the right people with your web site - (2009-04-23)
  [2324] What search terms FAIL to bring visitors to our site, when they should? - (2009-08-05)
  [2330] Update - Automatic feeds to Twitter - (2009-08-09)
  [2428] Diluting History - (2009-09-27)
  [2562] Tuning the web site for sailing on through this year - (2010-01-03)
  [2686] Freedom of Information - consideration for web site designers - (2010-03-20)
  [2748] Monitoring the success and traffic of your web site - (2010-05-01)
  [3670] Reading Google Analytics results, based on the relative populations of countries - (2012-03-24)
  [3746] Google Analytics and the new UK Cookie law - (2012-06-02)
  [4121] Has your Twitter feed stopped working? Switching to their new API - (2013-06-23)

Back to
Review of 2008

Previous and next
or
Horse's mouth home

Forward to
Plagarism - who is copying my pages?

Some other Articles

Required Request
Pettifog and forum boards away from public view
Plagarism - who is copying my pages?
Search Engines. Getting the right pages seen.
Review of 2008
LinkedIn - Thrice Asked, and joined.
Background to the TransWilts Train Fiasco
How much does a train service actually cost
Why hasnt the fiasco been put right