| ||||||||||||||||
| ||||||||||||||||
shortest match between XML tags Posted by aynar (aynar), 6 December 2004 Hiya,I'm trying to write a regular expression that matches a list of (optionally comma-separated) XML marked-up items followed by the word "and" and a final single item. I have something like: $item = '(<ITEM>.*?</ITEM>)(,?)(\s*)' $itemlist = '($item)+(and)($item)' Given the following string: " <ITEM>One</ITEM> is first and then we have <ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> " the intention is to match on only "<ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> " but the current expression matches on "<ITEM>One...Four</ITEM>. I think this is because of the leftmost-ness of Perk's matching and the fact that the shortest match operator only applies to potential matches starting from the same point. So the learlier (leftmost) match wins over the later (though shorter) one. How can I fix this? I know that changing to $item = '<ITEM>[^<]*?</ITEM>' works but I'd rather not exclude the possibility of furhter mark-up inside the <ITEM> tags if at all possible. What I really want to do is always stop at the first occurrence of the close tag but I haven't been able to find out how to do this. Can anyone help? Thanks. Posted by Custard (Custard), 6 December 2004 Hi,I haven't time right now to come up with a better tested solution, but have you tried the XML: ![]() Modified example from the POD... use XML: ![]() my $parser = new XML: ![]() ![]() my $doc = $parser->parsefile ("file.xml"); # print all HREF attributes of all CODEBASE elements my $nodes = $doc->getElementsByTagName ("CODEBASE"); my $n = $nodes->getLength; my @items; for (my $i = 0; $i < $n; $i++) { my $node = $nodes->item ($i); my $item = $node->getAttributeNode ("ITEM"); print $item->getValue . "\n"; push @items, $item->getValue(); } Then you can shift off the first element of @items my $first = shift @items; And the others will be left in the array @items; my ($two,$three,$four)=@items; HTH If you need help, don't hesitate to ask. B Posted by aynar (aynar), 6 December 2004 Thanks, Custard, but I think you've misunderstood what I'm trying to do. Chopping off the first item only fixes this particular instance of the problem. In general, I want to match on sequences (of varying length) of <ITEM>s that end in "and <ITEM>blah</ITEM>".Posted by dt2 (dt2), 6 December 2004 Perhaps the problem is more fundamental? There won't always be a <ITEM>One</ITEM> so a mandatory shift is not applicable.Presumably the goal is to have Perl's regex engine recognize that it's not a match as soon as it hits the first </ITEM>...then it'll move on to the next <ITEM> and correctly match that. As far as I know, Perl has no implementation for right-to-left parsing of regexs, so that is out. (Although I heard a rumor it's a suggestion for Perl 6?) Also AFAIK, Perl has no way to say something like: m#<ITEM>[^"<ITEM>"]*?</ITEM>#; (i.e., NOT a given string). Does anyone know of a way to do that? While I'm a big fan of XML: ![]() --dt2 Posted by Custard (Custard), 6 December 2004 Ok, I am a little confused, but I wonder if this is what you want... Code:
Which produced... (needs a little tidying...) Code:
Maybe this won't catch every possibility. Have you more test data? HTH B Posted by admin (Graham Ellis), 6 December 2004 on 12/06/04 at 16:44:47, dt2 wrote:
The words "negative look ahead assertion" come into my mind, but it's not exactly common and the syntax doesn't just spring into my head. And I'm currently 170 miles from my Regular Expression book .... Posted by Custard (Custard), 6 December 2004 Yes,it's those "(?!=pattern)" type patterns. I am not 100% sure I understand what output the OP wants though. Or indeed if the input is XML, (Which it increasingly looks like it is not). Code:
The s/// cleans up the input string by removing the first <ITEM> </ITEM> pair that isn't preceded by a comma. Still messy and very specific to this one line of test data. Produces Code:
But again, this is still dependant on getting some better test data... B Posted by dt2 (dt2), 6 December 2004 Custard's response looks very good. My only thought is there must be a way to get this all in a single line/regex. If I understand the original question right, I think I have a solution.Assuming that the goal is to use a single regex to match a list of items (as defined in the original post), then the following code works: Code:
...giving output: Code:
This ensures that you are able to embed tags within the ITEM tag to no detriment, but you cannot embed other ITEM tags within an ITEM tag -- in other words, it won't start with <ITEM>One... and continue through the end of ...Two</ITEM>, because it won't accept the <ITEM> tag that introduces "Two". Many thanks to Custard and Graham for very helpful leads. On that note, I found the following reference for negative lookahead assertions that proved very helpful (see the section with ?! pattern): http://www.perl.com/doc/manual/html/pod/perlre.html#item__pattern_ aynar, what do you think? All the best, --dt2 P.S. Congrats Custard on recently being appointed a moderator! Posted by aynar (aynar), 7 December 2004 Thanks, dt2, your regular expression definitely does exactly what I want. (I didn't think I could use the negative lookahead operator since I hadn't considered searching for the '<' to fix the point from which to check against the unwanted tag.)Unfortunately, this seems to make the searching hideously inefficient and I need to process hundreds of thousands sentences. So, unless someone can suggest how to achieve the same results but faster, it looks like I'm back to using [^<]*. ![]() How frustrating, knowing there's a perfect solution but not being able to use it! This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| |||||||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |