Exercises, examples and other material relating to training module P608. This topic is presented on public courses
Using Perl on the Web,
Perl Extra
Although most users of the web will be seated at a browser and will call up pages one by one, there's also a requirement for automated browsing tools. For example, a search engine such as Google will methodically visit a site page by page, indexing the entire content for its customers, and a web site validation program will visit each page in turn in order to find any broken links before live users do. Perl is an excellent language for writing automated browsing tools such as these, and in this module we study the techniques you may wish to use, and also the etiquette involved in writing socially acceptable automata.
Articles and tips on this subject | updated |
2402 | Automated Browsing in Perl I'm reminded on today's Perl course just how powerful some of the modules are, and how much you can do in so little code.
LWP::UserAgent turns your Perl into an automated browser .. the following four lines reading the robots.txt off my web site.
use LWP::UserAgent;
$connex = new LWP::UserAgent("agent" ... | 2009-09-11 (short) |
2229 | Do not re-invent the wheel - use a Perl module "If you think 'surely someone has done this before', you're probably right ... and in Perl, you'll find the resource you need available as a module on your system, or if it's not quite to common, on the CPAN". I was reminded of this advise today, when I got involved with web site checking ... and rather ... | 2009-06-12 |
2045 | Does robots.txt actually work? If you put an entry into your robots.txt file to ask the various robots to disallow (cease crawling) certain files and directories, do they actually take note of your request ... considering that it's a purely voluntary standard ...
Three or four days back, I excluded some old map pages which were being ... | 2009-02-17 |
1031 | robots.txt - a clue to hidden pages? The robots.txt file is designed to provide spiders and crawlers with a list of places they should NOT go - it's described as the "robot exclusion standard" file and its intent is to allow the webmaster to segregate his site into indexable and non-indexable.
But because it lists directorys to be excluded, ... | 2007-01-17 |
Background information
Some modules are
available for download as a sample of our material or under an
Open Training Notes License for free download from
[here].
Topics covered in this module
Definitions.
Cautions.
Checking a page, links and sites.
Checking a single page.
Checking links and included files.
Checking a site.
Things to do with a pet spider.
Being considerate.
The robots exclusion standard.
Bandwidth.
Complete learning
If you are looking for a complete course and not just a information on a single subject, visit our
Listing and schedule page.
Well House Consultants specialise in training courses in
Ruby,
Lua,
Python,
Perl,
PHP, and
MySQL. We run
Private Courses throughout the UK (and beyond for longer courses), and
Public Courses at our training centre in Melksham, Wiltshire, England.
It's surprisingly cost effective to come on our public courses -
even if
you live in a different
country or continent to us.
We have a technical library of over 700 books on the subjects on which we teach.
These books are available for reference at our training centre.