Exercises, examples and other material relating to training module P608. This topic is presented on public courses Using Perl on the Web
, Perl Extra
Although most users of the web will be seated at a browser and will call up pages one by one, there's also a requirement for automated browsing tools. For example, a search engine such as Google will methodically visit a site page by page, indexing the entire content for its customers, and a web site validation program will visit each page in turn in order to find any broken links before live users do. Perl is an excellent language for writing automated browsing tools such as these, and in this module we study the techniques you may wish to use, and also the etiquette involved in writing socially acceptable automata.
|Articles and tips on this subject||updated|
|2402||Automated Browsing in Perl|
I'm reminded on today's Perl course just how powerful some of the modules are, and how much you can do in so little code.
LWP::UserAgent turns your Perl into an automated browser .. the following four lines reading the robots.txt off my web site.
$connex = new LWP::UserAgent("agent" ...
|2229||Do not re-invent the wheel - use a Perl module|
"If you think 'surely someone has done this before', you're probably right ... and in Perl, you'll find the resource you need available as a module on your system, or if it's not quite to common, on the CPAN". I was reminded of this advise today, when I got involved with web site checking ... and rather ...
|2045||Does robots.txt actually work?|
If you put an entry into your robots.txt file to ask the various robots to disallow (cease crawling) certain files and directories, do they actually take note of your request ... considering that it's a purely voluntary standard ...
Three or four days back, I excluded some old map pages which were being ...
|1031||robots.txt - a clue to hidden pages?|
The robots.txt file is designed to provide spiders and crawlers with a list of places they should NOT go - it's described as the "robot exclusion standard" file and its intent is to allow the webmaster to segregate his site into indexable and non-indexable.
But because it lists directorys to be excluded, ...
Some modules are available for download
as a sample of our material or under an Open Training Notes License
for free download from [here]
Topics covered in this module
Checking a page, links and sites.
Checking a single page.
Checking links and included files.
Checking a site.
Things to do with a pet spider.
The robots exclusion standard.
If you are looking for a complete course and not just a information on a single subject, visit our Listing and schedule
Well House Consultants specialise in training courses in
. We run
throughout the UK (and beyond for longer courses), and
at our training centre in Melksham, Wiltshire, England.
It's surprisingly cost effective to come on our public courses -
even if you live in a different
country or continent to us
We have a technical library of over 700 books on the subjects on which we teach.
These books are available for reference at our training centre.