Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
HTML entity encoding

Posted by John_Moylan (jfp), 10 October 2003
I have a regex problem that I'd like apperciate external input on.

I have some html and need to encode the entities, for this I use HTML::Entities
However this encodes the whole string.

Take the example html below.

<p>test</p>
<p>Some text with "quotes" in it</p>
<p><img src="/images/picture.gif" alt="Alt text goes here"></p>

I can use encode_entities on the string to catch all entities.
I then do a
s/lt/</g
and a
s/gt/>/g
to get the tags back (Though theres probably a better way)

Problem is that the alt tag has quotes that are now quot which is wrong within html tags

I'm currently thinking along the lines of using the s/// with the 'e' modifier and trying to eval the captures pattern only.
Am I over complicating this?
Is there a more obvious and better way?

Any help appreciated
jfp

NOTE: ampersand and semi colon deliberatly left of as YaBB was encoding them to special chars

Posted by Custard (Custard), 10 October 2003
Hiya,

I may be missing something here, but my take on this would be to
parse and encode only the text between the tags, not the text of the tags themselves.
I may have missed the point, but I came up with this...

my $html=qq[
<p>bruces test text with "quotes"</p>
<p><img src="/images/picture.gif" alt="Alt text goes here"></p>
];

for (split("\n",$html)) {
       (/(<.+>)([^<>]*)(<.+>)/) && do {
               $open=$1;
               $text=$2;
               $close=$3;
               print( "text to encode = $text\n" );
       };
}

I'm sure Graham would be appalled, and it has weaknesses (it supports only 1 tag pair per line etc.).

Any help?

Bruce


Posted by admin (Graham Ellis), 11 October 2003
Really interesting question this one.  Let me take a step back from the writing a piece of code to provide an answer, and look at the problem ....

Under what circumstances do you want to change a < into a &lt;, and a " into a &quot; ?    I suggest that the question's an impossible one to answer in generality - your data is already too far processed in the form it's presented in in the original post, and is potentially ambiguous. Let's say I was actually writing a reply about HTML processing (Oh - I am doing ) and wanted to talk about the <p> and </p> tags ... then what algorithm am I going to use to pick out those < >  & and " that need processing and those that don't - the data is already too processed.  In other words, if I was processing this very paragraph you're reading, how would I know to leave the <p> alone, but to translate the <i>  and </i> that is around the words "what algorithm" just above?   The data is too far "gone" at the point you presented it to us

The only "correct" and general answer to the original question is that you should do your processing before you add in the formatting tags - but if you can't do that, then the answer is that any solution that does what you want is correct in your particular circumstance.  Hmm - it's quite early in the morning - hope I've explained my thoughts / concerns in an understandable way.

OK .... this is a practical world and perhaps you have little choice but look for a fix that's specific to the type of data that you have?  Unless it uses a defined way of marking up from which you can be 100% certain which " < > and & characters are genuinly to be visible and which are part of the makeup, no algorithm will be 100%.

Faced with this dilemma, here's a possible solution:

@parts = split (/(<.+?>)/,$html);
foreach (@parts) {
     !/^</ and s/"/&quot;/g;
     }
$html = join("",@parts);


Better test that:

Code:
$html = << "WOW";
<p>test</p>
<p>Some text with "quotes" in it</p>
<p><img src="/images/picture.gif" alt="Alt text goes here"></p>
WOW

@parts = split (/(<.+?>)/,$html);
foreach (@parts) {
     !/^</ and s/"/&quot;/g;
     }
$html = join("",@parts);

print $html;


and run

Code:
[Graham-Elliss-Computer:~] graham% perl jfp
<p>test</p>
<p>Some text with &quot;quotes&quot; in it</p>
<p><img src="/images/picture.gif" alt="Alt text goes here"></p>
[Graham-Elliss-Computer:~] graham%


Yes - it did what I intended, which is about the best that I think can be offered in the circumstances

Notes - if you split and bracket the string on which you're splitting, the split pattern gets saved into the resultant list.   If you loop through all the elements of a list in a foreach, then changing the variable to which element is "assigned" ($_ in this case) actually changes the element within the list.



Posted by admin (Graham Ellis), 11 October 2003
Custard - I'm NOT appalled by your suggestion - in fact you'll notice that my solution has liberally plagarised yours.   If it works well and consistently for the particular data that's thrown at it, then it's a good solution in this case - the incoming data is "dirty" and the real solution is to do the conversions before the real tags are added.

Just as an aside - HTML processing is a bit of an oddity.  It's one of the few data types where splitting up the data at \n characters is not normally a good idea, as a \n means no more than "white space".

Anyway - welcome to the board.   This one's an intiguing dilemma!




This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho