Training, Open Source computer languages

PerlPHPPythonMySQLhttpd / TomcatTclRubyJavaC and C++LinuxCSS

Search our site for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Perl Regular Expressions - finding the position and length of the match
If you want to find the position of a match in an incoming string, simply check the length of $` (That's $PREMATCH if you've chosen to use English;) to check where it starts, and add the length of $& (that's $MATCH) to find where it ends.

Lets say I want to find all the URLs referred to in a web page that's loaded into the variable $html. I could write:


push @section,[length($`),length($&mps;),$1]
while ($html =~ m!(https?://[^ >"]+)!g);


and that will give me a list of 3-element lists containing start point, length and actual string matched. Here's the code to display that list:


foreach $element(@section) {
print (join(", ",@$element),"\n");
}


and here's some of the results from the sources of our resources index


5979, 36, http://www.wellho.net/forum/top.html
6967, 36, http://www.wellho.net/net/mouth.html
7059, 42, http://www.wellho.net/downloads/index.html
8369, 67, http://www.wellho.net/mouth/387_Training-course-plans-for-2006.html
9365, 43, http://www.trainingcenter.co.uk/travel.html
9516, 45, http://reiseauskunft.bahn.de/bin/query.exe/en
9599, 59, http://www.livedepartureboards.co.uk/ldb/summary.aspx?T=MKM
9861, 48, https://lightning.he.net/~wellho/net/secure.html


P.S. I loaded my whole web page into a single variable using the code

open (FH,"/Library/WebServer/live_html/resources/index.html");
undef $/;
$html = <FH>;

which is a nice little demo of changing (or removing) the delimiter character for reading from a file handle, via the $/ variable. Once $/ has been undef-fed, reading into a scalar slurps from the current pointer in the file right through to the end of file.
(written 2006-02-02 04:30:52)

Commentatorsays ...
Dave Cross:You should warn people that using $`, $& and $' is a potential performance hit as any use of one of those variables in a program means that Perl has to track all of those variables for every match in your program. You can get the same information without the performance implications by using @- and @+.

And, I know this is just a demonstration, but encouraging people to parse HTML using regexes is a really bad idea. It's a much better idea to use something like HTML::Parser (or one of its subclasses like, in this case, HTML::LinkExtor).
(comment added 2006-02-02 06:52:23)
Graham Ellis:Thanks, Dave. Totally agree your comments. However there can be so many "if"s and "but"s added to any example that it becomes hard to see the wood from the trees.

Yes, there are FAR better ways of parsing HTML but it was a nice example and, yes, $` and friends can be ineffiicient. So if you want to say where in a string a regular expression match is to be found, what you you use as a more efficient alternative?
(comment added 2006-02-02 07:53:53)
Dave Cross:As I mentioned in my first comment, you can get the information using @- and @+.

push @section,[$-[0],$+[0] - $-[1],$1]
while ($html =~ m!(https?://[^ >"]+)!g);

One other point I forgot to mention earlier.

Special variables like $/ should only ever be changed using 'local' in a block - so that they regain their former value once you exit that block. You don't want to leave interesting values in those variables which might break the rest of your program.

So I'd write your example as:

open (FH,"/Library/WebServer/live_html/resources/index.html");

my $html;
{
local $/ = undef;

$html = ;
}

Or, more idiomatically:

open (FH,"/Library/WebServer/live_html/resources/index.html");

my $html = do { local $/; };

(comment added 2006-02-02 10:42:31)
Graham Ellis:Don't you just love the way there's always half a dozen ways to do things in Perl. Truely a great language, but one that's biased toward being fantastic to use for the practitioner who's really deep into it.

Dave - many thanks for all the inputs / alternatives / caveats. I agree 'em all ... (and note your @+ and @- comments that I overlooked yesterday). I hope we haven't frightened of the newcomer who asked what he felt was going to be answered by a single simple line!
(comment added 2006-02-03 07:23:34)
Associated topics are indexed under
P212 - Perl - More on Character Strings

Back to
Looking for Python staff
Previous and next
or
Horse's mouth home
Forward to
Job vacancy - double agent wanted

Some other Articles
Danny and Donna are getting married
Robust PHP user inputs
Changing @INC - where Perl loads its modules
Job vacancy - double agent wanted
Perl Regular Expressions - finding the position and length of the match
Looking for Python staff
Loosing breath with Gerald
Remember to process blank lines
DWIM and AWWO
Saving a MySQL query results to your local disc for Excel
1710 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

© WELL HOUSE CONSULTANTS LTD., 2008: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho