Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Finding all matches to a Regular Expression

Posted by enquirer (enquirer), 25 November 2002
I have a regular expression to find a special substring within a string. The substring consists of 2 chars (capital) followed by 11 digits. The substring can be located anywhere within the string. I'm using the following code to extract the substring:

($number) = $line =~ m{ ( [A-Z]{2}\d{9} ) }ix;

My problem is now that there can be multible instances of the substring within the string, but my code only gets the first instance. Does anyone know how to extend the code to grep all the instances of the substring into an array so that $number would be @number containing one instance per element?

Posted by admin (Graham Ellis), 25 November 2002
@number = $line =~ m{ ([A-Z]{2}\d{9}) }gx;

The g switch requests a global match and will return you a list of all matches within $line.   I have removed your i switch which is "ignore case", since you say that you need two capital letters.



Posted by John_Moylan (jfp), 26 November 2002
I've just been working on something related so I suppose this may help someone.

Needed to change all links in my webpages from (html / htm) to jsp.
However we only need to change relative links as we assume absolute links go to external sites, and we only want to do it within <a> tags.

After googling and adapting bits this seems to work

s/html?/jsp/gi if /<a\s{1}(:?href|HREF)="[^http:][^>]+>/i;

Though looking at it now I dont think I need to group the href|HREF as I've used the i switch.

Also assumes that href is the first attribute after the <a
Might need more work I suppose.

comments any regex experts out there?

jfp

Posted by admin (Graham Ellis), 27 November 2002
Quote:
After googling and adapting bits this seems to work
                                   
s/html?/jsp/gi if /<a\s{1}(:?href|HREF)="[^http:][^>]+>/i;
                                   
Though looking at it now I dont think I need to group the href|HREF as I've used the i switch.


If it does what you need, that's great - Perl is a PRACTICAL language after all,
even if you appear to be putting the competion (Java) into all your web pages

There are three "levels" of coding (IMHO) ...

Level 1 - Code that is required to work on a single instance of a data set.  It may
not be a correct solution for all possible data sets, but it works in the needed
instance.   A couple of years ago, I had to extract 800 addresses from HTML web
pages that had been set up by hand and names, postcode, counties etc were "all over
the place".  Certainly level 1 code, and indeed I had to manually adjust a few results;
I remember one county of "Isle of Skye" and confusion between S (Sheffield) and SO
(Southampton) postcodes - especially in cases where the original data was S-zero.

Level 2 - Code designed for use by its author on multiple data sets. It may not be
a correct solution for everyone / according to the full specification of the input,
but it does work for all the data that's thrown at it.   Special cases (Such as, in
my example, someone with the surname "Wiltshire" specified without a "Mr" or initial)
could cause problems.

Level 3 - Robust code that is written accoring to the specification of the input.
My "Wiltshire" example would work, a file that someone has called "html.html" would
be correctly changed to "html.jsp" in jfp's example, etc.

Level 3 coding can require some considerable investment in writing and isn't always
necessary, even if the computer purists will point you in its direction.  You should
aim to achieve this level if you're using it on a public facing web server; otherwise,
there's a chance of your users finding something really nasty they can enter - a.k.a.
a security hole ....

I don't think you'll ever fully achieve level 3 when matching a spoken language, or
if you're doing character or voice recognition, or translations.  And I don't think
it was achievable either in THAT address data set I described.  

Coming back to the question ... (you caught me in expansive mood this morning!) ...

The first solution posted looks distinctly level 1 to me.  The [^http:] means "not
an h, a t, a p or a colon" and I don't think that's at all what was intended, and the
format is very tightly ties to the data - where and how many spaces there are is very
significant, which might be the case in jfp's data stream, but not in the general
case.   I should point out that {1} is pointless as it means "exactly one of the
element in front" and that's the default anyhow.  It will only work with HTML where
the references are quoted (I know that's how you're supposed to write HTML).  Perhaps
all the original HTML was generated by DreamWeaver or something similar?   If so, the
solution may be fine as is ... excepting "level 2" issues such as files called html.html
and the like!

Here's my "first cut" at an alternative and more rigerous solution to your
requirement:

Code:
s/(<A\s+href\s*=\s*"?(?!http:)[^">?#]+)\.html?([">?#\s]|$)/$1.jsp$2/gi;


Still not level 3 - won't cope with other protocols such as links to FTP, won't cope with
A tags with other attributes if they come before the HREF, etc.  Will cope with
quoted or unquoted URLs, and pages with nasty names like the html.htm I keep
quoting.  Will cope with multiple URLs on one line, and with tags that have other
attributes after the HREF.  Also copes with tags with GET method data and links
within the page after the URL, even if the URL contains nasty text.

Beware - I really should have tested this before posting, but I'm leaving that for
the moment - have paid work to do .... please do post a "yes that worked" or
"and to correct you, Graham" responses.


Posted by John_Moylan (jfp), 27 November 2002
>>please do post a "yes that worked" or  "and to correct you"
Come on now, you were just toying with us with the above remark.
It worked like a charm!

My first attempt used lookarounds but this was causing me problems, purlely down to a lack of knowledge, but I think I learnt a bit about lookbehind and lookahead from it so thats no a bad thing.

jfp

Posted by admin (Graham Ellis), 27 November 2002
No, I wasn't toying with you; it's so easy to post something unchecked and find out later that it doesn't work - I've done so on a number of occasions - so where I go out on a limb / take a gamble and post without testing, I clearly say so.

Glad it worked;  look ahead is prteey rare in practice, so much so that we don't really cover it in training.  Mind - this one's a fun example and I may just include an adaption of it on the advanced Perl course!



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho