shortest match between XML tags
Posted by aynar (aynar), 6 December 2004Hiya,
I'm trying to write a regular expression that matches a list of (optionally comma-separated) XML marked-up items followed by the word "and" and a final single item.
I have something like:
$item = '(<ITEM>.*?</ITEM>)(,?)(\s*)'
$itemlist = '($item)+(and)($item)'
Given the following string:
" <ITEM>One</ITEM> is first and then we have <ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> "
the intention is to match on only "<ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> " but the current expression matches on "<ITEM>One...Four</ITEM>. I think this is because of the leftmost-ness of Perk's matching and the fact that the shortest match operator only applies to potential matches starting from the same point. So the learlier (leftmost) match wins over the later (though shorter) one.
How can I fix this? I know that changing to
$item = '<ITEM>[^<]*?</ITEM>' works but I'd rather not exclude the possibility of furhter mark-up inside the <ITEM> tags if at all possible.
What I really want to do is always stop at the first occurrence of the close tag but I haven't been able to find out how to do this.
Can anyone help?
Posted by Custard (Custard), 6 December 2004Hi,
I haven't time right now to come up with a better tested solution, but have you tried the XML:OM parser on CPAN. Modules, particularly XML ones are always a good idea because XML is complex and you can end up with broken code easily if the XML document changes in some way.
Modified example from the POD...
my $parser = new XML:OM:arser;
my $doc = $parser->parsefile ("file.xml");
# print all HREF attributes of all CODEBASE elements
my $nodes = $doc->getElementsByTagName ("CODEBASE");
my $n = $nodes->getLength;
for (my $i = 0; $i < $n; $i++)
my $node = $nodes->item ($i);
my $item = $node->getAttributeNode ("ITEM");
print $item->getValue . "\n";
push @items, $item->getValue();
Then you can shift off the first element of @items
my $first = shift @items;
And the others will be left in the array @items;
If you need help, don't hesitate to ask.
Posted by aynar (aynar), 6 December 2004Thanks, Custard, but I think you've misunderstood what I'm trying to do. Chopping off the first item only fixes this particular instance of the problem. In general, I want to match on sequences (of varying length) of <ITEM>s that end in "and <ITEM>blah</ITEM>".
Posted by dt2 (dt2), 6 December 2004Perhaps the problem is more fundamental? There won't always be a <ITEM>One</ITEM> so a mandatory shift is not applicable.
Presumably the goal is to have Perl's regex engine recognize that it's not a match as soon as it hits the first </ITEM>...then it'll move on to the next <ITEM> and correctly match that.
As far as I know, Perl has no implementation for right-to-left parsing of regexs, so that is out. (Although I heard a rumor it's a suggestion for Perl 6?) Also AFAIK, Perl has no way to say something like:
(i.e., NOT a given string). Does anyone know of a way to do that?
While I'm a big fan of XML:OM personally, I don't see how it can be used (cleanly, easily) in this situation to match a pattern such as aynar described. Of course, no offense Custard...I have no doubt your detailed post is much appreciated. (Heck...at least you offered a solution, as opposed to what I've done, which is merely to present more problems! LOL)
Posted by Custard (Custard), 6 December 2004Ok,
I am a little confused, but I wonder if this is what you want...
(needs a little tidying...)
Maybe this won't catch every possibility. Have you more test data?
Posted by admin (Graham Ellis), 6 December 2004on 12/06/04 at 16:44:47, dt2 wrote:
The words "negative look ahead assertion" come into my mind, but it's not exactly common and the syntax doesn't just spring into my head. And I'm currently 170 miles from my Regular Expression book ....
Posted by Custard (Custard), 6 December 2004Yes,
it's those "(?!=pattern)" type patterns.
I am not 100% sure I understand what output the OP wants though.
Or indeed if the input is XML, (Which it increasingly looks like it is not).
The s/// cleans up the input string by removing the first <ITEM> </ITEM> pair that isn't preceded by a comma.
Still messy and very specific to this one line of test data.
But again, this is still dependant on getting some better test data...
Posted by dt2 (dt2), 6 December 2004Custard's response looks very good. My only thought is there must be a way to get this all in a single line/regex. If I understand the original question right, I think I have a solution.
Assuming that the goal is to use a single regex to match a list of items (as defined in the original post), then the following code works:
This ensures that you are able to embed tags within the ITEM tag to no detriment, but you cannot embed other ITEM tags within an ITEM tag -- in other words, it won't start with <ITEM>One... and continue through the end of ...Two</ITEM>, because it won't accept the <ITEM> tag that introduces "Two".
Many thanks to Custard and Graham for very helpful leads. On that note, I found the following reference for negative lookahead assertions that proved very helpful (see the section with ?! pattern):
aynar, what do you think?
All the best,
P.S. Congrats Custard on recently being appointed a moderator!
Posted by aynar (aynar), 7 December 2004Thanks, dt2, your regular expression definitely does exactly what I want. (I didn't think I could use the negative lookahead operator since I hadn't considered searching for the '<' to fix the point from which to check against the unwanted tag.)
Unfortunately, this seems to make the searching hideously inefficient and I need to process hundreds of thousands sentences. So, unless someone can suggest how to achieve the same results but faster, it looks like I'm back to using [^<]*.
How frustrating, knowing there's a perfect solution but not being able to use it!
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: firstname.lastname@example.org • WEB: http://www.wellho.net • SKYPE: wellho