regular expression help
Posted by pketans (pketans), 16 March 2005Hi I am looking for some regular expression help.
I am trying to get all the attributes for anchor and image tag of HTML.
1) I want to Get all atributes for anchor link. This means link name, href, title,text for any anchor link on a html page and store it in array.
2) I want to Get all image tag attributes. This means src, alt for any image tag on a HTML page and store it in array.
Please let me know if I am not clear. Thanks in advance for the help.
Posted by admin (Graham Ellis), 17 March 2005HTML and XML tags are a structure that aren't ideally matched using regular expressions - they were originally designed for use through an efficient tag parser such as Expat, written in C or some such. However, you can usually achieve a fairly good match with regular expressions ...
Here's a piece of code:
It copes with quoted and unquoted attributes. It falls over if you try to put an escaped quote character in the attribute, and if you have attributes without values - however, you don't mention that you're looking for any attributes of this format ... the one I commonly come across is CHECKED on a radio box.
I have run this against our home page and it seems to work ... all you need to do is change the innermost loop to put the results into an array rather than print them out, and you have something close to what you're looking for.
Some sample output:
href ... /net/ofcourse.html
href="http://www.wellho.net/cgi-bin/opentalk/YaBB.pl" class="sidelinks" target="forum"
href ... http://www.wellho.net/cgi-bin/opentalk/YaBB.pl
class ... sidelinks
target ... forum
href ... /course/tcl-tk.html
src="/wellimg/WHhead2.gif" alt="Well House Consultants Ltd" width="680" height="100" longdesc="http://www.wellho.net" border="0"
src ... /wellimg/WHhead2.gif
alt ... Well House Consultants Ltd
width ... 680
height ... 100
longdesc ... http://www.wellho.net
border ... 0
P.S. You would probably want to replace the " in the matches with ['"] ... HTML can have its attribute values protected by single as well as double quotes.
Posted by pketans (pketans), 18 March 2005Hi Graham,
Works great . I appreciate your quick reply.
One more last regex questions.
I need one more last regular expressiion help.
Below is my my sentence.
London Hotels - Book Hotels in London, London UK.
I want to know how many time a word "London Hotels" appear in above sentence.
Appears means a exact match and a fragmenred match.
Exact Match will be first 2 words of teh sentence. So it would return 1 exact match and highlist it.
For Fragmented match you will see Hotels comes first and then London so thats 1 fragment match is found and highlight it.
so In above mentioned sentence we found 2 matches for keyword "London Hotels"
Posted by admin (Graham Ellis), 18 March 2005Not sure if it's exactly what you're looking for but this code finds all occurrences of both words straight after one another, then all times where they occur with anything in between.
There's more on the subject of matching (site searching) in The Horse's Mouth written on 1st January
Posted by pketans (pketans), 18 March 2005Hi Graham,
The problem with this script is that you need 2 words. I need a way to check for 1 word to any number of words.
The goal is to get exact match or fragmented match of the keyword. You are going in right direction.
Posted by admin (Graham Ellis), 18 March 2005on 03/18/05 at 19:26:50, pketans wrote:
Good. Please feel free to extend my example to meet your needs.
Posted by pketans (pketans), 19 March 2005Hi Graham,
Trying to figure out a way to fix it form past couple hours. But no luck also the fragment search is not working on the script u gave me. Please if you can help me.
Posted by admin (Graham Ellis), 19 March 2005I wish I had all the time in the world to write complete scripts and solutions for everyone ... but alas my time is limited. I'm very happy to help sort out coding and design issues and make suggestions, but there's a limit to how far I can go.
You are very welcome indeed to post up your own code and ask for advise on it / suggestions as to how to improve it or fix problems. That's what I'm here for. There's a posting FAQ on our "Assistance" board to give you guidelines as to how to get the best of this free service.
I'm guessing that you're not very familiar with regular expressions ... that guess is made because I see that you're not quoting back sample code of your own. You would also be very welcome, then, to book on and attend our regular expression course which I suspect would be a great help to you, and where you would be buying a lot more of my time / advise ... you would go away with either an excellent solution, or an excellent route forward to a solution. If you're not familiar with other aspects of PHP either, you may prefer to look at our complete PHP programming course - we still have vacancies on the public course that's coming up in early April.
My understanding of your specification was incomplete at first - I think I understand it better now (but there are likely to be things I haven't yet realised ...). On current spec, I would be tempted to write a loop that checked with a simple regular expression for each word of interest in turn, with an additional check looking for all of the words (if more than one) in order. Do have a look back too at some of my comments on our own search engine, and the links, earlier in this thread
PH: 01225 708225 • FAX: 01225 793803 • EMAIL: email@example.com • WEB: http://www.wellho.net • SKYPE: wellho