Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
convert a MS Word doc into multiple HTML pages

Posted by lang2 (lang2), 4 March 2004
The MS Word file contains text and images (saved into the file, not OLE).  For example, if the file has 3 headings 1, 1.1, 1.2, there should be 3 HTML files (one for each heading). Text and images between 1 heading to the next must be extracted.  How do I know when I read up to the next heading?  What are the options or easiest option available to do the task?  Can I use Perl to parse and extract data from the Word file?
Thanks.

Posted by John_Moylan (jfp), 4 March 2004
Just a quick reply bacause this amused me.

I thought it was an interesting question and decided to google for "microsoft word" and "perl"

The first relevent page (3rd in the list) was:
http://www.wellho.net/solutions/1480965085.html

Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.

jfp



Posted by admin (Graham Ellis), 4 March 2004
Word files are accessible through COM (Common Object Method) - you need to be running your Perl on a machine that has MS Word installed as it uses their .dll files.  You'll find the necessary Perl module is supplies as part of the ActiveState release of Perl.

References - two books
http://www.wellho.net/book/0-7821-2862-9.html
http://www.wellho.net/book/1-57870-067-1.html

Here's a piece of code we use to extract text from a Word document - you should find it contains examples of thesort of thing you need

Code:
# Perl program - extracting from a Word document ready for upload!

use Win32::OLE;
use Win32::OLE::Enum;

print "Name of Word document: ";
chomp ($doc = <STDIN>);

$document = Win32::OLE -> GetObject($doc);
print "Name of upload file: ";
chomp ($upload = <STDIN>) ;
open (FH,">$upload");
   print "Extracting Text ...$document \n";
   $paragraphs = $document->Paragraphs();
   $enumerate = new Win32::OLE::Enum($paragraphs);
   while(defined($paragraph = $enumerate->Next()))
   {
       $style = $paragraph->{Style}->{NameLocal};
       print FH "+$style\n";
           $text = $paragraph->{Range}->{Text};
           $text =~ s/[\n\r]//g;
           $text =~ s/\x0b/\n/g;
           print FH "=$text\n";
   }
     for ($k=2; $k<10; $k++) {
     $prog = "\"C:\\Program\ Files\\Advanced\ Batch\ Converter\\abc.exe\"";
     print ("Name of image file to use: ");
     chomp ($imgname = <STDIN>);
     last unless ($imgname);
     $instruct = $outstruct = "$imgname";
     #$outstruct = "/convert=X$k.gif";
     if ($instruct =~ /wmf$/) {
       $also = "/resize=(640,0,1)";
       $outfile = "X$k.gif";
     } else {
       $also = "/resize=(640,0,1)";
       $outfile = "X$k.jpg";
     }
     print "Converting $instruct ...\n";
     $result = `$prog $instruct $also /convert=$outfile`;
     open (FHI,$outfile);
     binmode FHI;
     read (FHI,$buffer,-s "$outfile");
     $buffer =~ s/(.)/sprintf("%02x",ord($1))/sge;
     $buffer =~ s/(.{1,68})/=$1\n/g;
     print FH "+pic$k\n=$imgname\n";
     print FH "+pic_$k\n";
     print FH "$buffer\n";
     }
     close (FH);
     print "Job completed.  Press [return] to exit ";
     $n = <STDIN>;


Posted by admin (Graham Ellis), 4 March 2004
on 03/04/04 at 19:30:28, jfp wrote:
Just a quick reply bacause this amused me.

I thought it was an interesting question and decided to google for "microsoft word" and "perl"

The first relevent page (3rd in the list) was:
http://www.wellho.net/solutions/1480965085.html

Well, they say that the best way to get high in the search rankings is to make sure your content it relevent.

jfp



I thought I had seen it somewhere before  



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho