convert a MS Word doc into multiple HTML pages

Posted by lang2 (lang2), 4 March 2004

The MS Word file contains text and images (saved into the file, not OLE). For example, if the file has 3 headings 1, 1.1, 1.2, there should be 3 HTML files (one for each heading). Text and images between 1 heading to the next must be extracted. How do I know when I read up to the next heading? What are the options or easiest option available to do the task? Can I use Perl to parse and extract data from the Word file?
Thanks.

Posted by John_Moylan (jfp), 4 March 2004

Just a quick reply bacause this amused me.

I thought it was an interesting question and decided to google for "microsoft word" and "perl"

The first relevent page (3rd in the list) was:
http://www.wellho.net/solutions/1480965085.html

Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.

jfp

Posted by admin (Graham Ellis), 4 March 2004

Word files are accessible through COM (Common Object Method) - you need to be running your Perl on a machine that has MS Word installed as it uses their .dll files. You'll find the necessary Perl module is supplies as part of the ActiveState release of Perl.

References - two books
http://www.wellho.net/book/0-7821-2862-9.html
http://www.wellho.net/book/1-57870-067-1.html

Here's a piece of code we use to extract text from a Word document - you should find it contains examples of thesort of thing you need

Code:

# Perl program - extracting from a Word document ready for upload!

use Win32::OLE;
use Win32::OLE::Enum;

print "Name of Word document: ";
chomp ($doc = <STDIN>);

$document = Win32::OLE -> GetObject($doc);
print "Name of upload file: ";
chomp ($upload = <STDIN>) ;
open (FH,">$upload");
print "Extracting Text ...$document \n";
$paragraphs = $document->Paragraphs();
$enumerate = new Win32::OLE::Enum($paragraphs);
while(defined($paragraph = $enumerate->Next()))
{
$style = $paragraph->{Style}->{NameLocal};
print FH "+$style\n";
$text = $paragraph->{Range}->{Text};
$text =~ s/[\n\r]//g;
$text =~ s/\x0b/\n/g;
print FH "=$text\n";
}
for ($k=2; $k<10; $k++) {
$prog = "\"C:\\Program\ Files\\Advanced\ Batch\ Converter\\abc.exe\"";
print ("Name of image file to use: ");
chomp ($imgname = <STDIN>);
last unless ($imgname);
$instruct = $outstruct = "$imgname";
#$outstruct = "/convert=X$k.gif";
if ($instruct =~ /wmf$/) {
$also = "/resize=(640,0,1)";
$outfile = "X$k.gif";
} else {
$also = "/resize=(640,0,1)";
$outfile = "X$k.jpg";
}
print "Converting $instruct ...\n";
$result = `$prog $instruct $also /convert=$outfile`;
open (FHI,$outfile);
binmode FHI;
read (FHI,$buffer,-s "$outfile");
$buffer =~ s/(.)/sprintf("%02x",ord($1))/sge;
$buffer =~ s/(.{1,68})/=$1\n/g;
print FH "+pic$k\n=$imgname\n";
print FH "+pic_$k\n";
print FH "$buffer\n";
}
close (FH);
print "Job completed. Press [return] to exit ";
$n = <STDIN>;

Posted by admin (Graham Ellis), 4 March 2004

on 03/04/04 at 19:30:28, jfp wrote:

I thought I had seen it somewhere before

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.