Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
PHP and Regular Expressions

Posted by Joe_Guynan (Joe_Guynan), 31 January 2007
Hi everyone, I am new here, I regularly attended a PHP at Well House Manor, and very good it was too, and now I am seeking advice, just like Graham had predicted.

well now, here is my predicament....

I am the function file(to read in a web page, then line by line, using eregi(), I am extracting the text src="...." so I can capture all the images and other included js file etc on the page.

here is my PHP code:

$urlDoc = file($_REQUEST[urlString],r) or die ("Error Accessing URL $_REQUEST[urlString]"); #capture the webpage into a variable
     #loop through each line and extract any src=
 if ($urlDoc[0]) {      
           print ("<hr />"); #output a horizontal rule            
           for ($i = 0 ; $i < count($urlDoc) ; $i++) {            
                 print (htmlspecialchars($urlDoc[$i])."<br />"); #outputs the HTML document line by line                  
                 eregi('src="([a-z0-9/.]+)"',$urlDoc[$i],$imgSrc); #extracts the value of src=                  
                 if ($imgSrc[1] != '') { #if an image exists on this line
                       print ("$entry<br />"); #output
                 unset($imgSrc); #remove the array

Now, the problem I have is this, there is a line on the HTMl page that reads....

</script><script language="JavaScript" src="/js/armph_products.js" type="text/javascript"></script><script language="JavaScript" src="/js/armph.js" type="text/javascript"></script><form action="/markets/home_solutions/armpoweredhouse.html" method="get" target="_blank">

And you can see that it contains two src="..." items, the first is src="/js/armph_products.js" and the second is src="/js/armph.js".

When I am outputing, line by line, I am only getting src="/js/armph.js" from that particular line.

I am guessing that eregi() just happily goes along each line and the last matching expression is the item that is stored into my variable $imgSrc.

Can anyone shed some light, or let me know what I am doing wrong?

Many thanks

Joe Guynan

Posted by admin (Graham Ellis), 31 January 2007
If you have multiple matches on a line, then I would go for preg_match_all (yes, I know, the OTHER regex handler   )  ... in fact, you could use a file_get_contents to read the whole file, then a preg_match_all to get every src= on the page .... let me know if you woul dlike me to post a worked example in the morning - I'm in Java mode tonight!

Posted by Joe_Guynan (Joe_Guynan), 1 February 2007
Thanks Graham, I took your advice and it works a treat.

Thanks again.


This page is a thread posted to the opentalk forum at and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: • WEB: • SKYPE: wellho