PHP and Regular Expressions

Posted by Joe_Guynan (Joe_Guynan), 31 January 2007
Hi everyone, I am new here, I regularly attended a PHP at Well House Manor, and very good it was too, and now I am seeking advice, just like Graham had predicted.

well now, here is my predicament....

I am the function file(to read in a web page, then line by line, using eregi(), I am extracting the text src="...." so I can capture all the images and other included js file etc on the page.

here is my PHP code:

$urlDoc = file($_REQUEST[urlString],r) or die ("Error Accessing URL $_REQUEST[urlString]"); #capture the webpage into a variable
     #loop through each line and extract any src=
 if ($urlDoc[0]) {      
           print ("<hr />"); #output a horizontal rule            
           for ($i = 0 ; $i < count($urlDoc) ; $i++) {            
                 print (htmlspecialchars($urlDoc[$i])."<br />"); #outputs the HTML document line by line                  
                 eregi('src="([a-z0-9/.]+)"',$urlDoc[$i],$imgSrc); #extracts the value of src=                  
                 if ($imgSrc[1] != '') { #if an image exists on this line
                       print ("$entry<br />"); #output
                 unset($imgSrc); #remove the array

Now, the problem I have is this, there is a line on the HTMl page that reads....

</script><script language="JavaScript" src="/js/armph_products.js" type="text/javascript"></script><script language="JavaScript" src="/js/armph.js" type="text/javascript"></script><form action="/markets/home_solutions/armpoweredhouse.html" method="get" target="_blank">

And you can see that it contains two src="..." items, the first is src="/js/armph_products.js" and the second is src="/js/armph.js".

When I am outputing, line by line, I am only getting src="/js/armph.js" from that particular line.

I am guessing that eregi() just happily goes along each line and the last matching expression is the item that is stored into my variable $imgSrc.

Can anyone shed some light, or let me know what I am doing wrong?

Many thanks

Joe Guynan

Posted by admin (Graham Ellis), 31 January 2007
If you have multiple matches on a line, then I would go for preg_match_all (yes, I know, the OTHER regex handler   )  ... in fact, you could use a file_get_contents to read the whole file, then a preg_match_all to get every src= on the page .... let me know if you woul dlike me to post a worked example in the morning - I'm in Java mode tonight!

Posted by Joe_Guynan (Joe_Guynan), 1 February 2007
Thanks Graham, I took your advice and it works a treat.

Thanks again.


