| |||||||||||
| |||||||||||
HTML entity encoding Posted by John_Moylan (jfp), 10 October 2003 I have a regex problem that I'd like apperciate external input on.I have some html and need to encode the entities, for this I use HTML::Entities However this encodes the whole string. Take the example html below. <p>test</p> <p>Some text with "quotes" in it</p> <p><img src="/images/picture.gif" alt="Alt text goes here"></p> I can use encode_entities on the string to catch all entities. I then do a s/lt/</g and a s/gt/>/g to get the tags back (Though theres probably a better way) Problem is that the alt tag has quotes that are now quot which is wrong within html tags I'm currently thinking along the lines of using the s/// with the 'e' modifier and trying to eval the captures pattern only. Am I over complicating this? Is there a more obvious and better way? Any help appreciated jfp NOTE: ampersand and semi colon deliberatly left of as YaBB was encoding them to special chars Posted by Custard (Custard), 10 October 2003 Hiya,I may be missing something here, but my take on this would be to parse and encode only the text between the tags, not the text of the tags themselves. I may have missed the point, but I came up with this... my $html=qq[ <p>bruces test text with "quotes"</p> <p><img src="/images/picture.gif" alt="Alt text goes here"></p> ]; for (split("\n",$html)) { (/(<.+>)([^<>]*)(<.+>)/) && do { $open=$1; $text=$2; $close=$3; print( "text to encode = $text\n" ); }; } I'm sure Graham would be appalled, and it has weaknesses (it supports only 1 tag pair per line etc.). Any help? Bruce Posted by admin (Graham Ellis), 11 October 2003 Really interesting question this one. Let me take a step back from the writing a piece of code to provide an answer, and look at the problem ....Under what circumstances do you want to change a < into a <, and a " into a " ? I suggest that the question's an impossible one to answer in generality - your data is already too far processed in the form it's presented in in the original post, and is potentially ambiguous. Let's say I was actually writing a reply about HTML processing (Oh - I am doing ![]() ![]() The only "correct" and general answer to the original question is that you should do your processing before you add in the formatting tags - but if you can't do that, then the answer is that any solution that does what you want is correct in your particular circumstance. Hmm - it's quite early in the morning - hope I've explained my thoughts / concerns in an understandable way. OK .... this is a practical world and perhaps you have little choice but look for a fix that's specific to the type of data that you have? Unless it uses a defined way of marking up from which you can be 100% certain which " < > and & characters are genuinly to be visible and which are part of the makeup, no algorithm will be 100%. Faced with this dilemma, here's a possible solution: @parts = split (/(<.+?>)/,$html); foreach (@parts) { !/^</ and s/"/"/g; } $html = join("",@parts); Better test that: Code:
and run Code:
Yes - it did what I intended, which is about the best that I think can be offered in the circumstances ![]() Notes - if you split and bracket the string on which you're splitting, the split pattern gets saved into the resultant list. If you loop through all the elements of a list in a foreach, then changing the variable to which element is "assigned" ($_ in this case) actually changes the element within the list. Posted by admin (Graham Ellis), 11 October 2003 Custard - I'm NOT appalled by your suggestion - in fact you'll notice that my solution has liberally plagarised yours. If it works well and consistently for the particular data that's thrown at it, then it's a good solution in this case - the incoming data is "dirty" and the real solution is to do the conversions before the real tags are added.Just as an aside - HTML processing is a bit of an oddity. It's one of the few data types where splitting up the data at \n characters is not normally a good idea, as a \n means no more than "white space". Anyway - welcome to the board. This one's an intiguing dilemma! This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |