convert a MS Word doc into multiple HTML pages
Posted by lang2 (lang2), 4 March 2004The MS Word file contains text and images (saved into the file, not OLE). For example, if the file has 3 headings 1, 1.1, 1.2, there should be 3 HTML files (one for each heading). Text and images between 1 heading to the next must be extracted. How do I know when I read up to the next heading? What are the options or easiest option available to do the task? Can I use Perl to parse and extract data from the Word file?
Posted by John_Moylan (jfp), 4 March 2004Just a quick reply bacause this amused me.
I thought it was an interesting question and decided to google for "microsoft word" and "perl"
The first relevent page (3rd in the list) was:
Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.
Posted by admin (Graham Ellis), 4 March 2004Word files are accessible through COM (Common Object Method) - you need to be running your Perl on a machine that has MS Word installed as it uses their .dll files. You'll find the necessary Perl module is supplies as part of the ActiveState release of Perl.
References - two books
Here's a piece of code we use to extract text from a Word document - you should find it contains examples of thesort of thing you need
Posted by admin (Graham Ellis), 4 March 2004on 03/04/04 at 19:30:28, jfp wrote:
I thought I had seen it somewhere before
This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: email@example.com • WEB: http://www.wellho.net • SKYPE: wellho