Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
This week, we're updating our course layouts and descriptions. Presentation and materials always gently change over time, but just occasionally there's a need to make a step change to clear out some of the old and roll in the new. That's now happening - but over a long and complex site it's not instant and you'll see sections of the site changing up to and including 19th September.

See also [here] for status update
 
regular expression help

Posted by pketans (pketans), 16 March 2005
Hi I am looking for some regular expression help.
I am trying to get all the attributes for anchor and image tag of HTML.

1) I want to Get all atributes for anchor link. This means link name, href, title,text for any anchor link on a html page  and store it in array.
2) I want to Get all image tag attributes. This means src, alt for any image tag on a HTML page and store it in array.

Please let me know if I am not clear. Thanks in advance for the help.



Posted by admin (Graham Ellis), 17 March 2005
HTML and XML tags are a structure that aren't ideally matched using regular expressions - they were originally designed for use through an efficient tag parser such as Expat, written in C or some such.  However, you can usually achieve a fairly good match with regular expressions ...

Here's a piece of code:

Code:
<?php

$text = join("",file("iii.html"));

# find all tags
preg_match_all('/<[^>]+>/s',$text,$tags);

foreach (array("a","img") as $starter) {
       foreach ($tags[0] as $link) {
               $gotten = preg_match('/^<\s*'.$starter.'\s*(.*)>/i',$link,$alist);
               if ($gotten) {
                       print "<b>$alist[1]</b><br>";
                       $cleaned = preg_replace('/\s+=\s+/','=',$alist[1]);
                       preg_match_all('/(?:^|\s)(\w+)="([^">]+)"/',$cleaned,$qatts);
                       preg_match_all('/(?:^|\s)(\w+)=([^"\s>]+)/',$cleaned,$patts);
                       $allatts = array_merge($patts[1],$qatts[1]);
                       $allvals = array_merge($patts[2],$qatts[2]);
                       for ($k=0; $k<count($allatts); $k++) {
                               print ("$allatts[$k] ... $allvals[$k]<br>");
                       }
               }
       }
}

?>


It copes with quoted and unquoted attributes. It falls over if you try to put an escaped quote character in the attribute, and if you have attributes without values - however, you don't mention that you're looking for any attributes of this format ... the one I commonly come across is CHECKED on a radio box.

I have run this against our home page and it seems to work ... all you need to do is change the innermost loop to put the results into an array rather than print them out, and you have something close to what you're looking for.

Some sample output:

href="/net/ofcourse.html"
href ... /net/ofcourse.html
href="http://www.wellho.net/cgi-bin/opentalk/YaBB.pl" class="sidelinks" target="forum"
href ... http://www.wellho.net/cgi-bin/opentalk/YaBB.pl
class ... sidelinks
target ... forum
href=/course/tcl-tk.html
href ... /course/tcl-tk.html
src="/wellimg/WHhead2.gif" alt="Well House Consultants Ltd" width="680" height="100" longdesc="http://www.wellho.net" border="0"
src ... /wellimg/WHhead2.gif
alt ... Well House Consultants Ltd
width ... 680
height ... 100
longdesc ... http://www.wellho.net
border ... 0


P.S. You would probably want to replace the " in the matches with ['"] ... HTML can have its attribute values protected by single as well as double quotes.

Posted by pketans (pketans), 18 March 2005
Hi Graham,
Works great . I appreciate your quick reply.
One more last regex questions.


I need one more last regular expressiion help.
Below is my my sentence.

London Hotels - Book Hotels in London, London UK.

I want to know how many time a word "London Hotels" appear in above sentence.

Appears means a exact match and a fragmenred match.

Exact Match will be first 2 words of teh sentence. So it would return 1 exact match and highlist it.

For Fragmented match you will see Hotels comes first and then London so thats 1 fragment match is found and highlight it.

so In above mentioned sentence we found 2 matches for keyword  "London Hotels"


Posted by admin (Graham Ellis), 18 March 2005
Not sure if it's exactly what you're looking for but this code finds all occurrences of both words straight after one another, then all times where they occur with anything in between.

Code:
<?php
if ($_GET[first] and $_GET[second]) {
       $line = $_GET[line];
       $first = $_GET[first];
       $second = $_GET[second];
       if (preg_match_all("/($first\\s+$second)/i",$line,$matches)) {
               $full = count($matches[0]);
       } else {
               $full = "none";
       }
       $result = "Full matches - $full<br>";
       if (preg_match_all("/($first.*?$second)/i",$line,$matches)) {
               $partial = count($matches[0]) - $full;
       } else {
               $partial = "none";
       }
       $result .= "additional  matches with words apart - $partial<br>";
} else {
       $result = "Please fill in two words";
       }
?>
<html><head><title>Regular Expression Demo</title></head><body>
Result: <?= $result ?><br>
<form>Line: <input name=line size=50><br>
Please give first word <input name=first> and second word <input name=second><br>
<input type=submit></form>
</body></html>


There's more on the subject of matching (site searching) in The Horse's Mouth written on 1st January

Posted by pketans (pketans), 18 March 2005
Hi Graham,
The problem with this script is that you need 2 words. I need a way to check for 1 word to any number of words.
The goal is to get exact match or fragmented match of the keyword. You are going in right direction.

Thanks

Posted by admin (Graham Ellis), 18 March 2005
on 03/18/05 at 19:26:50, pketans wrote:
... You are going in right direction ...


Good. Please feel free to extend my example to meet your needs.

Posted by pketans (pketans), 19 March 2005
Hi Graham,
Trying to figure out a way to fix it form past couple hours. But no luck also the fragment search is not working on  the script u gave me. Please if you can help me.
thanks,
ketan

Posted by admin (Graham Ellis), 19 March 2005
I wish I had all the time in the world to write complete scripts and solutions for everyone ... but alas my time is limited.  I'm very happy to help sort out coding and design issues and make suggestions, but there's a limit to how far I can go.

You are very welcome indeed to post up your own code and ask for advise on it / suggestions as to how to improve it or fix problems.  That's what I'm here for. There's a posting FAQ on our "Assistance" board to give you guidelines as to how to get the best of this free service.

I'm guessing that you're not very familiar with regular expressions ... that guess is made because I see that you're not quoting back sample code of your own.   You would also be very welcome, then, to book on and attend our regular expression course which I suspect would be a great help to you, and where you would be buying a lot more of my time / advise ... you would go away with either an excellent solution, or an excellent route forward to a solution.   If you're not familiar with other aspects of PHP either, you may prefer to look at our complete PHP programming course - we still have vacancies on the public course that's coming up in early April.

My understanding of your specification was incomplete at first - I think I understand it better now (but there are likely to be things I haven't yet realised ...).   On current spec, I would be tempted to write a loop that checked with a simple regular expression for each word of interest in turn, with an additional check looking for all of the words (if more than one) in order.  Do have a look back too at some of my comments on our own search engine, and the links, earlier in this thread



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho