Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Extract first 20 words

Posted by TedH (TedH), 11 November 2006
My script is writing to various files. Part of that is a text entry denoted say $input{'content'}

I can grab that fine and write the whole lot as needed.

But how do I just grab, say, the first 20 words of it and disregard the rest?

I've spent about 3 days searching around and not found anything that even remotely approaches the concept (plenty of stuff about extracting lines and records, but not a predefined number of words).

Hope you can help,
thanks - Ted

Posted by admin (Graham Ellis), 12 November 2006
I this the sort of thing?

Code:
dolphin:~ graham$ cat plop
$text = "The course was good and the tutor was better";
@words = split(/\s+/,$text);
$starter = join(" ",@words[0..6]);
print "$starter ... \n";

dolphin:~ graham$ perl plop
The course was good and the tutor ...
dolphin:~ graham$



Posted by TedH (TedH), 12 November 2006
Thanks Graham.

I had been looking at a split, but was way off base.

The join - I didn't even know there was such a thing.

I see how it works and that $starter now contains the extracted words from the new array @words. From there I can use $starter and refine my results, like HTML tag exclusions (the input is from a WYSIWYG editor) etc.

many thanks - Ted

Posted by admin (Graham Ellis), 12 November 2006
Been there before, Ted, on the "removing tags" thing.

If you add in:
Code:
$text =~ s/<.*?>/ /gs;

then you'll remove all the tags and replace each of them with a space.   Do that before the split, by the way, which compresses any multiple resulting white spaces into single spaces.

If you need to go a bit more detailed / sophisticates, but may also want to get involved with deciding which tags result in a word break (things like <br>) and which can occur in the middle of a work (things like <u>); in that latter case, you would actually want to replace them with nothing rather than a space.   You might also want to get involved with replacing sequences like < with < characters ....


Posted by TedH (TedH), 12 November 2006
I had done some but they were after the split. I noticed inconsistancy happening. So gave the one (shorter than mine   ) you suggested a go.

The input is from a WYSIWYG editor and has different responses depending on which browser is used. The one thing that IE does is put in the
Code:
&nbsp;

for spaces sometimes. I managed to clear those. It can insert entities on occassion.

There are probably others I will need to attend to as I continue the development.

The extraction is for an RSS feed, and XML does not like entities at all from what I see.

So far I've been testing it and my feed is getting written with new entries and updated, if I edit it - a feature for those of us who type too fast and put in "teh" instead of the, then discover it in our feed later on.



Posted by TedH (TedH), 12 November 2006
Well, I haven't broken it yet.  

Thanks for your input Graham.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho