HTML entity encoding - Perl Programming

Posted by John_Moylan (jfp), 10 October 2003

I have a regex problem that I'd like apperciate external input on.

I have some html and need to encode the entities, for this I use HTML::Entities
However this encodes the whole string.

Take the example html below.

test
Some text with "quotes" in it
<img src="/images/picture.gif" alt="Alt text goes here">

I can use encode_entities on the string to catch all entities.
I then do a
s/lt/</g
and a
s/gt/>/g
to get the tags back (Though theres probably a better way)

Problem is that the alt tag has quotes that are now quot which is wrong within html tags

I'm currently thinking along the lines of using the s/// with the 'e' modifier and trying to eval the captures pattern only.
Am I over complicating this?
Is there a more obvious and better way?

Any help appreciated
jfp

NOTE: ampersand and semi colon deliberatly left of as YaBB was encoding them to special chars

Posted by Custard (Custard), 10 October 2003

Hiya,

I may be missing something here, but my take on this would be to
parse and encode only the text between the tags, not the text of the tags themselves.
I may have missed the point, but I came up with this...

my $html=qq[
bruces test text with "quotes"
<img src="/images/picture.gif" alt="Alt text goes here">
];

for (split("\n",$html)) {
(/(<.+>)([^<>]*)(<.+>)/) && do {
$open=$1;
$text=$2;
$close=$3;
print( "text to encode = $text\n" );
};
}

I'm sure Graham would be appalled, and it has weaknesses (it supports only 1 tag pair per line etc.).

Any help?

Bruce

Posted by admin (Graham Ellis), 11 October 2003

Really interesting question this one. Let me take a step back from the writing a piece of code to provide an answer, and look at the problem ....

Under what circumstances do you want to change a < into a <, and a " into a " ? I suggest that the question's an impossible one to answer in generality - your data is already too far processed in the form it's presented in in the original post, and is potentially ambiguous. Let's say I was actually writing a reply about HTML processing (Oh - I am doing

) and wanted to talk about the and tags ... then what algorithm am I going to use to pick out those < > & and " that need processing and those that don't - the data is already too processed. In other words, if I was processing this very paragraph you're reading, how would I know to leave the alone, but to translate the and that is around the words "what algorithm" just above? The data is too far "gone" at the point you presented it to us

The only "correct" and general answer to the original question is that you should do your processing before you add in the formatting tags - but if you can't do that, then the answer is that any solution that does what you want is correct in your particular circumstance. Hmm - it's quite early in the morning - hope I've explained my thoughts / concerns in an understandable way.

OK .... this is a practical world and perhaps you have little choice but look for a fix that's specific to the type of data that you have? Unless it uses a defined way of marking up from which you can be 100% certain which " < > and & characters are genuinly to be visible and which are part of the makeup, no algorithm will be 100%.

Faced with this dilemma, here's a possible solution:

@parts = split (/(<.+?>)/,$html);
foreach (@parts) {
!/^</ and s/"/"/g;
}
$html = join("",@parts);

Better test that:

Code:

$html = << "WOW";
test
Some text with "quotes" in it
<img src="/images/picture.gif" alt="Alt text goes here">
WOW

@parts = split (/(<.+?>)/,$html);
foreach (@parts) {
!/^</ and s/"/"/g;
}
$html = join("",@parts);

print $html;

and run

Code:

[Graham-Elliss-Computer:~] graham% perl jfp
test
Some text with "quotes" in it
<img src="/images/picture.gif" alt="Alt text goes here">
[Graham-Elliss-Computer:~] graham%

Yes - it did what I intended, which is about the best that I think can be offered in the circumstances

Notes - if you split and bracket the string on which you're splitting, the split pattern gets saved into the resultant list. If you loop through all the elements of a list in a foreach, then changing the variable to which element is "assigned" ($_ in this case) actually changes the element within the list.

Posted by admin (Graham Ellis), 11 October 2003

Custard - I'm NOT appalled by your suggestion - in fact you'll notice that my solution has liberally plagarised yours. If it works well and consistently for the particular data that's thrown at it, then it's a good solution in this case - the incoming data is "dirty" and the real solution is to do the conversions before the real tags are added.

Just as an aside - HTML processing is a bit of an oddity. It's one of the few data types where splitting up the data at \n characters is not normally a good idea, as a \n means no more than "white space".

Anyway - welcome to the board. This one's an intiguing dilemma!

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.