Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
convert a MS Word doc into multiple HTML pages

Posted by lang2 (lang2), 4 March 2004
The MS Word file contains text and images (saved into the file, not OLE).  For example, if the file has 3 headings 1, 1.1, 1.2, there should be 3 HTML files (one for each heading). Text and images between 1 heading to the next must be extracted.  How do I know when I read up to the next heading?  What are the options or easiest option available to do the task?  Can I use Perl to parse and extract data from the Word file?
Thanks.

Posted by John_Moylan (jfp), 4 March 2004
Just a quick reply bacause this amused me.

I thought it was an interesting question and decided to google for "microsoft word" and "perl"

The first relevent page (3rd in the list) was:
http://www.wellho.net/solutions/1480965085.html

Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.

jfp



Posted by admin (Graham Ellis), 4 March 2004
Word files are accessible through COM (Common Object Method) - you need to be running your Perl on a machine that has MS Word installed as it uses their .dll files.  You'll find the necessary Perl module is supplies as part of the ActiveState release of Perl.

References - two books
http://www.wellho.net/book/0-7821-2862-9.html
http://www.wellho.net/book/1-57870-067-1.html

Here's a piece of code we use to extract text from a Word document - you should find it contains examples of thesort of thing you need

Code:
# Perl program - extracting from a Word document ready for upload!

use Win32::OLE;
use Win32::OLE::Enum;

print "Name of Word document: ";
chomp ($doc = <STDIN>);

$document = Win32::OLE -> GetObject($doc);
print "Name of upload file: ";
chomp ($upload = <STDIN>) ;
open (FH,">$upload");
   print "Extracting Text ...$document \n";
   $paragraphs = $document->Paragraphs();
   $enumerate = new Win32::OLE::Enum($paragraphs);
   while(defined($paragraph = $enumerate->Next()))
   {
       $style = $paragraph->{Style}->{NameLocal};
       print FH "+$style\n";
           $text = $paragraph->{Range}->{Text};
           $text =~ s/[\n\r]//g;
           $text =~ s/\x0b/\n/g;
           print FH "=$text\n";
   }
     for ($k=2; $k<10; $k++) {
     $prog = "\"C:\\Program\ Files\\Advanced\ Batch\ Converter\\abc.exe\"";
     print ("Name of image file to use: ");
     chomp ($imgname = <STDIN>);
     last unless ($imgname);
     $instruct = $outstruct = "$imgname";
     #$outstruct = "/convert=X$k.gif";
     if ($instruct =~ /wmf$/) {
       $also = "/resize=(640,0,1)";
       $outfile = "X$k.gif";
     } else {
       $also = "/resize=(640,0,1)";
       $outfile = "X$k.jpg";
     }
     print "Converting $instruct ...\n";
     $result = `$prog $instruct $also /convert=$outfile`;
     open (FHI,$outfile);
     binmode FHI;
     read (FHI,$buffer,-s "$outfile");
     $buffer =~ s/(.)/sprintf("%02x",ord($1))/sge;
     $buffer =~ s/(.{1,68})/=$1\n/g;
     print FH "+pic$k\n=$imgname\n";
     print FH "+pic_$k\n";
     print FH "$buffer\n";
     }
     close (FH);
     print "Job completed.  Press [return] to exit ";
     $n = <STDIN>;


Posted by admin (Graham Ellis), 4 March 2004
on 03/04/04 at 19:30:28, jfp wrote:
Just a quick reply bacause this amused me.

I thought it was an interesting question and decided to google for "microsoft word" and "perl"

The first relevent page (3rd in the list) was:
http://www.wellho.net/solutions/1480965085.html

Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.

jfp



I thought I had seen it somewhere before  



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho