Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
shortest match between XML tags

Posted by aynar (aynar), 6 December 2004
Hiya,

I'm trying to write a regular expression that matches a list of (optionally comma-separated) XML marked-up items followed by the word "and" and a final single item.

I have something like:

$item = '(<ITEM>.*?</ITEM>)(,?)(\s*)'
$itemlist = '($item)+(and)($item)'

Given the following string:

" <ITEM>One</ITEM> is first and then we have <ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> "

the intention is to match on only "<ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> " but the current expression matches on "<ITEM>One...Four</ITEM>. I think this is because of the leftmost-ness of Perk's matching and the fact that the shortest match operator only applies to potential matches starting from the same point. So the learlier (leftmost) match wins over the later (though shorter) one.

How can I fix this? I know that changing to

$item = '<ITEM>[^<]*?</ITEM>' works but I'd rather not exclude the possibility of furhter mark-up inside the <ITEM> tags if at all possible.

What I really want to do is always stop at the first occurrence of the close tag but I haven't been able to find out how to do this.

Can anyone help?

Thanks.


Posted by Custard (Custard), 6 December 2004
Hi,

I haven't time right now to come up with a better tested solution, but have you tried the XML:OM parser on CPAN.  Modules, particularly XML ones are always a good idea because XML is complex and you can end up with broken code easily if the XML document changes in some way.

Modified example from the POD...

use XML:OM;

my $parser = new XML:OM:arser;
my $doc = $parser->parsefile ("file.xml");

# print all HREF attributes of all CODEBASE elements
my $nodes = $doc->getElementsByTagName ("CODEBASE");
my $n = $nodes->getLength;

my @items;
for (my $i = 0; $i < $n; $i++)
{
    my $node = $nodes->item ($i);
    my $item = $node->getAttributeNode ("ITEM");
    print $item->getValue . "\n";
    push @items, $item->getValue();
}

Then you can shift off the first element of @items

my $first = shift @items;

And the others will be left in the array @items;

my ($two,$three,$four)=@items;

HTH

If you need help, don't hesitate to ask.

B

Posted by aynar (aynar), 6 December 2004
Thanks, Custard, but I think you've misunderstood what I'm trying to do. Chopping off the first item only fixes this particular instance of the problem. In general, I want to match on sequences (of varying length) of <ITEM>s that end in "and <ITEM>blah</ITEM>".

Posted by dt2 (dt2), 6 December 2004
Perhaps the problem is more fundamental?  There won't always be a <ITEM>One</ITEM> so a mandatory shift is not applicable.

Presumably the goal is to have Perl's regex engine recognize that it's not a match as soon as it hits the first </ITEM>...then it'll move on to the next <ITEM> and correctly match that.

As far as I know, Perl has no implementation for right-to-left parsing of regexs, so that is out.  (Although I heard a rumor it's a suggestion for Perl 6?)  Also AFAIK, Perl has no way to say something like:

m#<ITEM>[^"<ITEM>"]*?</ITEM>#;

(i.e., NOT a given string).  Does anyone know of a way to do that?

While I'm a big fan of XML:OM personally, I don't see how it can be used (cleanly, easily) in this situation to match a pattern such as aynar described.  Of course, no offense Custard...I have no doubt your detailed post is much appreciated.  (Heck...at least you offered a solution, as opposed to what I've done, which is merely to present more problems! LOL)

--dt2


Posted by Custard (Custard), 6 December 2004
Ok,

I am a little confused, but I wonder if this is what you want...

Code:
#!/usr/bin/perl

use Data::Dumper;

my @items;
while (<DATA>) {
       @items=(/<ITEM>(\w+)<\/ITEM>\s*(,|and|$)\s*/g);

       print "\n\n";
       print Dumper @items;
       print "\n\n";
       print join(' ', @items);
       print "\n\n";
}

__DATA__
<ITEM>One</ITEM> is first and then we have <ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM>


Which produced...

(needs a little tidying...)

Code:
$VAR1 = 'Two';
$VAR2 = ',';
$VAR3 = 'Three';
$VAR4 = ',';
$VAR5 = 'Four';
$VAR6 = '';


Two , Three , Four



Maybe this won't catch every possibility. Have you more test data?

HTH

B

Posted by admin (Graham Ellis), 6 December 2004
on 12/06/04 at 16:44:47, dt2 wrote:
As far as I know, Perl has no implementation for right-to-left parsing of regexs, so that is out.  (Although I heard a rumor it's a suggestion for Perl 6?)  Also AFAIK, Perl has no way to say something like:

m#<ITEM>[^"<ITEM>"]*?</ITEM>#;

(i.e., NOT a given string).  Does anyone know of a way to do that?

--dt2


The words "negative look ahead assertion" come into my mind, but it's not exactly common and the syntax doesn't just spring into my head.  And I'm currently 170 miles from my Regular Expression book ....

Posted by Custard (Custard), 6 December 2004
Yes,

it's those "(?!=pattern)" type patterns.

I am not 100% sure I understand what output the OP wants though.
Or indeed if the input is XML, (Which it increasingly looks like it is not).

Code:
while (<DATA>) {
       s/(?<!,)\s*<ITEM>\w+<\/ITEM>//;
       @items=(/<ITEM>(\w+)<\/ITEM>/g);

       print "\n\n";
       print Dumper @items;
       print "\n\n";
       print join(' ', @items);
       print "\n\n";
}


The s/// cleans up the input string by removing the first <ITEM> </ITEM> pair that isn't preceded by a comma.
Still messy and very specific to this one line of test data.

Produces

Code:
$VAR1 = 'Two';
$VAR2 = 'Three';
$VAR3 = 'Four';


Two Three Four



But again, this is still dependant on getting some better test data...

B

Posted by dt2 (dt2), 6 December 2004
Custard's response looks very good.  My only thought is there must be a way to get this all in a single line/regex.  If I understand the original question right, I think I have a solution.

Assuming that the goal is to use a single regex to match a list of items (as defined in the original post), then the following code works:

Code:
#!/usr/bin/perl

my $s = '<ITEM>One</ITEM> is first and then we have ' .
     '<ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> ';
my $t = 'Apples and <ITEM>big <i>juicy</i> oranges</ITEM> ' .
     'are good; <ITEM>pears</ITEM>, <ITEM><b>red</b> grapes</ITEM>, ' .
     'and <ITEM>other <note>misc</note> <em>things</em></ITEM> ' .
     'are better.';

my $item = '(?:<ITEM>(?:[^<]*(?:<(?!ITEM))?[^<]*)+<\/ITEM>(?:,?)(?:\s*))';
# note careful use of single quotes as well as interpolated $item
my $itemlist = '(?:'.$item.')+and\s*(?:'.$item.')';

while ($s =~ /$itemlist/g) {
     print STDERR '['.$&."]\n";
}
while ($t =~ /$itemlist/g) {
     print STDERR '['.$&."]\n";
}


...giving output:

Code:
[<ITEM>Two</ITEM>, <ITEM>Three</ITEM>, and <ITEM>Four</ITEM> ]
[<ITEM>pears</ITEM>, <ITEM><b>red</b> grapes</ITEM>, and <ITEM>other <note>misc</note> <em>things</em></ITEM> ]


This ensures that you are able to embed tags within the ITEM tag to no detriment, but you cannot embed other ITEM tags within an ITEM tag -- in other words, it won't start with <ITEM>One... and continue through the end of ...Two</ITEM>, because it won't accept the <ITEM> tag that introduces "Two".

Many thanks to Custard and Graham for very helpful leads.    On that note, I found the following reference for negative lookahead assertions that proved very helpful (see the section with ?! pattern):

http://www.perl.com/doc/manual/html/pod/perlre.html#item__pattern_

aynar, what do you think?

All the best,
--dt2

P.S.  Congrats Custard on recently being appointed a moderator!



Posted by aynar (aynar), 7 December 2004
Thanks, dt2, your regular expression definitely does exactly what I want. (I didn't think I could use the negative lookahead operator since I hadn't considered searching for the '<' to fix the point from which to check against the unwanted tag.)

Unfortunately, this seems to make the searching hideously inefficient and I need to process hundreds of thousands sentences. So, unless someone can suggest how to achieve the same results but faster, it looks like I'm back to using [^<]*.

How frustrating, knowing there's a perfect solution but not being able to use it!



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho