|
Perl Regular Expressions - finding the position and length of the match
If you want to find the position of a match in an incoming string, simply check the length of $` (That's $PREMATCH if you've chosen to use English;) to check where it starts, and add the length of $& (that's $MATCH) to find where it ends.
Lets say I want to find all the URLs referred to in a web page that's loaded into the variable $html. I could write:
push @section,[length($`),length($&mps;),$1]
while ($html =~ m!(https?://[^ >"]+)!g);
and that will give me a list of 3-element lists containing start point, length and actual string matched. Here's the code to display that list:
foreach $element(@section) {
print (join(", ",@$element),"\n");
}
and here's some of the results from the sources of our resources index
5979, 36, http://www.wellho.net/forum/top.html
6967, 36, http://www.wellho.net/net/mouth.html
7059, 42, http://www.wellho.net/downloads/index.html
8369, 67, http://www.wellho.net/mouth/387_Training-course-plans-for-2006.html
9365, 43, http://www.trainingcenter.co.uk/travel.html
9516, 45, http://reiseauskunft.bahn.de/bin/query.exe/en
9599, 59, http://www.livedepartureboards.co.uk/ldb/summary.aspx?T=MKM
9861, 48, https://lightning.he.net/~wellho/net/secure.html
P.S. I loaded my whole web page into a single variable using the code
open (FH,"/Library/WebServer/live_html/resources/index.html");
undef $/;
$html = <FH>;
which is a nice little demo of changing (or removing) the delimiter character for reading from a file handle, via the $/ variable. Once $/ has been undef-fed, reading into a scalar slurps from the current pointer in the file right through to the end of file. (written 2006-02-02 04:30:52)
| Commentator | says ... | | Dave Cross: | You should warn people that using $`, $& and $' is a potential performance hit as any use of one of those variables in a program means that Perl has to track all of those variables for every match in your program. You can get the same information without the performance implications by using @- and @+.
And, I know this is just a demonstration, but encouraging people to parse HTML using regexes is a really bad idea. It's a much better idea to use something like HTML::Parser (or one of its subclasses like, in this case, HTML::LinkExtor). (comment added 2006-02-02 06:52:23) | | Graham Ellis: | Thanks, Dave. Totally agree your comments. However there can be so many "if"s and "but"s added to any example that it becomes hard to see the wood from the trees.
Yes, there are FAR better ways of parsing HTML but it was a nice example and, yes, $` and friends can be ineffiicient. So if you want to say where in a string a regular expression match is to be found, what you you use as a more efficient alternative? (comment added 2006-02-02 07:53:53) | | Dave Cross: | As I mentioned in my first comment, you can get the information using @- and @+.
push @section,[$-[0],$+[0] - $-[1],$1]
while ($html =~ m!(https?://[^ >"]+)!g);
One other point I forgot to mention earlier.
Special variables like $/ should only ever be changed using 'local' in a block - so that they regain their former value once you exit that block. You don't want to leave interesting values in those variables which might break the rest of your program.
So I'd write your example as:
open (FH,"/Library/WebServer/live_html/resources/index.html");
my $html;
{
local $/ = undef;
$html = ;
}
Or, more idiomatically:
open (FH,"/Library/WebServer/live_html/resources/index.html");
my $html = do { local $/; };
(comment added 2006-02-02 10:42:31) | | Graham Ellis: | Don't you just love the way there's always half a dozen ways to do things in Perl. Truely a great language, but one that's biased toward being fantastic to use for the practitioner who's really deep into it.
Dave - many thanks for all the inputs / alternatives / caveats. I agree 'em all ... (and note your @+ and @- comments that I overlooked yesterday). I hope we haven't frightened of the newcomer who asked what he felt was going to be answered by a single simple line! (comment added 2006-02-03 07:23:34) |
Associated topics are indexed under P212 - Perl - More on Character Strings
Some other Articles
Danny and Donna are getting marriedRobust PHP user inputsChanging @INC - where Perl loads its modulesJob vacancy - double agent wantedPerl Regular Expressions - finding the position and length of the matchLooking for Python staffLoosing breath with GeraldRemember to process blank linesDWIM and AWWOSaving a MySQL query results to your local disc for Excel
|
2259 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46 at 50 posts per page
This is a page archived from The Horse's Mouth at
http://www.wellho.net/horse/ -
the diary and writings of Graham Ellis.
Every attempt was made to provide current information at the time the
page was written, but things do move forward in our business - new software
releases, price changes, new techniques. Please check back via
our main site for current courses,
prices, versions, etc - any mention of a price in "The Horse's Mouth"
cannot be taken as an offer to supply at that price.
Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).
|
|