"If you think 'surely someone has done this before', you're probably right ... and in Perl, you'll find the resource you need available as a module on your system, or if it's not quite to common, on the CPAN". I was reminded of this advise today, when I got involved with web site checking ... and rather than writing my own robotic browser in Perl, I used the LWP module ("Library for Web Processes" in case you wondered!)
What can I do with LWP? Well - I have several new examples to show you.
Reporting all the internal and external links from a page - this uses LWP::Simple, standard on my Perl and easy to use
A short example that grabs a page and echos its content and status, using a minimal series of calls to the more complete LWP module
A script that grabs a web page, then checks all the links from it - a prototype example which needs some more work, but it's already found a broken link to an external site from one of our pages - and such things are very time-consuming to monitor by hand!
Here's an example of the sort of outputs you can get from that last program:
Dorothy-2:perl grahamellis$ perl goodlinks http://www.wellhousemanor.co.uk/
Status from http://www.wellhousemanor.co.uk/whm.css is 200
Status from https://lightning.he.net/~wellho/hotel/reservation.php is 500
Status from http://www.wellhousemanor.co.uk/rooms.html is 200
Status from http://www.wellho.net/happens/rooms.php is 200
Status from http://www.wellhousemanor.co.uk/amenities.html is 200
Status from http://www.wellhousemanor.co.uk/events.html is 200
Status from http://www.wellhousemanor.co.uk/contact.html is 200
Status from http://www.westwiltshire.gov.uk/index/env/env-health-service
/food-hygiene/scores-on-doors.htm is 404
Status from http://www.wellho.net is 200
Status from http://www.wiltshirebusinessoftheyear.co.uk/ is 200
Status from http://www.aguafabrics.com/default.asp is 200
Status from http://www.hoteldesigns.net/industrynews/news_2745.html is 200
Status from http://www.macformat.co.uk is 200
Status from http://www.wellhousemanor.co.uk/art.html is 200
Status from http://www.tripadvisor.co.uk/ is 200
Status from http://www.tripadvisor.co.uk/Hotel_Review-g528775-d645951-
Reviews-Well_House_Manor-Melksham_Wiltshire_England.html is 200
Status from http://www.freeindex.co.uk/profile(Well-House-Consultants-Ltd)
_44477.htm is 200
Status from http://validator.w3.org/check is 200
Dorothy-2:perl grahamellis$ (written 2009-06-11, updated 2009-06-12)
2327
Associated topics are indexed under
P219 - Perl - Libraries and Resources [3377] What do I mean when I add things in Perl? - (2011-08-02)
[3101] The week before Christmas - (2010-12-23)
[3009] Expect in Perl - a short explanation and a practical example - (2010-10-22)
[2931] Syncronise - software, trains, and buses. Please! - (2010-08-22)
[2427] Operator overloading - redefining addition and other Perl tricks - (2009-09-27)
[2234] Loading external code into Perl from a nonstandard directory - (2009-06-12)
[1865] Debugging and Data::Dumper in Perl - (2008-11-02)
[1863] About dieing and exiting in Perl - (2008-11-01)
[1444] Using English can slow you right down! - (2007-11-25)
[1391] Ordnance Survey Grid Reference to Latitude / Longitude - (2007-10-14)
[1235] Outputting numbers as words - MySQL with Perl or PHP - (2007-06-17)
[1219] Judging the quality of contributed Perl code - (2007-06-06)
[760] Self help in Perl - (2006-06-14)
[737] Coloured text in a terminal from Perl - (2006-05-29)
[712] Why reinvent the wheel - (2006-05-06)
[358] Use standard Perl modules - (2005-06-25)
[357] Where do Perl modules load from - (2005-06-24)
[112] Avoid the wheel being re-invented by using Perl modules - (2004-11-08)
[86] Talk review - Idiomatic Perl, David Cross - (2004-10-12)
P408 - Perl - Standard Web Modules [3485] Perl - retrieving and caching web resources - (2011-10-18)
[2416] Automating access to a page obscured behind a holding page - (2009-09-23)
[2402] Automated Browsing in Perl - (2009-09-11)
[975] Answering ALL the delegate's Perl questions - (2006-12-09)
P608 - Perl - Robots, Crawlers and Spiders [2045] Does robots.txt actually work? - (2009-02-16)
[1031] robots.txt - a clue to hidden pages? - (2007-01-13)
P405 - Perl - Web Service - Our Own Client
Some other Articles
Transforming data in Perl using lists of lists and hashes of hashesWhy sendmail one way, and pop3 the other?What is CGI.pm / A dozen new examplesRunning a piece of code is like drinking a pint of beerDo not re-invent the wheel - use a Perl moduleWhere do I start when writing a program?Learning PHP, Ruby, Lua and Python - upcoming coursesRevision / Summary of lists - PerlHow important is a front page ranking on a search engine?Trowbridge - a missed opportunity? Melksham - into the breach?