Well House Consultants Ltd

Search site for:

Further Information:
Home
What's New
Resource Centre
WHC Library
Opentalk Forum
About Us

AA AA AA AA
Accessability
Perl Programming - Perl, mod-perl, CGI, etc.

segment a text using Perl (help)
Posted by agme (agme), 10 December 2004
I'm a beginner in Perl. I would like to make a program that will be able to segment a text in paragraphs and sentences and then tokenise it to have all the words.
I'll defined 23 rules for not end of sentence (NFDP) and 13 rules end of sentence (FDP). I'll use the regular expression to do it.
The syntax is : regleNFDP1\t  ($regex)\t    non-fin-de-
                       ($nom_regle)   ($regex)     ($marque)
phrase\t    marque_milieu\n
                       ($type)                                    

Example: NFDP
X1= PONCTO = ( . ? ! )
X3=PARENTH=( (), [ ], { })

If                 TXT has X1
AND IF       before X1 there is X3
AND IF       after  X1 there is X3
THEN          X1 is not the end of a sentence
I send what I have done so far, can someone help to correct it and make it work?

#!/usr/bin/perl -w
use strict;
my $fichier = $ARGV[0];
my $regles = $ARGV[1];

open (F,$fichier);
my $texte = join("\n",<F>);
close (F);
main ($texte);

sub main {

     my $texte = shift;
     $texte = segmenter_phrase($texte);
     $texte = baliser_phrase($texte);
     $texte = traiter_paragraphe($texte);
     $texte = generer_docbook($texte);
     return $texte
     
}

sub segmenter_phrase{
     my $texte = shift;
     open (R, $regles) or die "erreur";
     while (my $regles =<R>){
           chomp($regles);
           $texte =appliquer_regle($texte,$regles);
         }
     }
   close(R);
 open(T, ">trace.txt");
 print T $texte;
 close(T);
   return $texte;
}

#traitement des regles/rule#1
sub appliquer_regle{
     my ($texte,$regles)=@_;
     my @parties = split("\t",$regles);
     my ($nom_regle, $regex, $marque, $type)= @parties;
          if ($type eq "marque_debut"){
$texte =~ s/$regex/$1{$marque\/$nom_regle\}$2$3/g;}
elsif ($type eq "marque_milieu"){
$texte =~ s/$regex/$1$2{$marque\/$nom_regle\}$3/g;}
          else{
$texte =~ s/$regex/$1$2$3{$marque\/$nom_regle\}/g;}
         return $texte;
}                  
sub baliser_phrase{
     my $texte_marque = shift;
     $texte_marque = ~s/\{NFDP/regle NFDP[0-9]+\}//g;
     $texte_marque = "<phrase>".$$texte_marque;
     $texte_marque = ~s/\{FDP/regle FDP[0-9]+\}/<\/phrase><phrase>/g;
     $texte_marque = $texte_marque. "</phrase>";
     $texte_marque = ~s/<phrase>(\n)*<\/phrase>//;
     return $texte_marque;                            
}
sub traiter_paragraphe{
   my $texte = shift;
   $texte = "<para>".$texte;
   $texte = ~s/(\n)/<\/para><para>/g;
   $texte = $texte."</para>"
   $texte = ~s/<para></para>//;
   return $texte;
}
sub generer_docbook{
     my $texte = shift;
     my $entete = "<?xml version=\"1.0\>\"?";
        $entete = $entete."!DOCTYPE article PUBLIC.....";
     my $texte_docbook = $entete."<article>\n<section>";
        $texte_docbook = $texte_docbook "<title/>";
      $texte_docbook =      $texte_docbook.$texte. "</section>\n<\article>";
      return $text_docbook;  
}
Posted by Custard (Custard), 10 December 2004
Hello.

I haven't even tried running it, or understanding it, but a quick skim through reveals..

Code:
my $fichier = $ARGV[0];
my $regles = $ARGV[0].


These are the same argument on the command line.. Is this what you want?

Code:
main(texte);


Is missing the $ sigil, should be $texte.

Code:
     $texte = segmenter_phrase($texte);
     $texte = traiter_paragrpahe($texte);
     $texte = generer_docboom($texte);


generer_docboom  probably ought to be generer_docbook.

Code:
open(T, ">trace.txt");
 print T$texte;
 close(T);


The print line should have a space after the T, like
print T $texte;


My advice for a happier life would be to put:

use strict;


at the top of your program. This will show up a lot of errors, but once fixed, you will have a program in better shape.
Then I will try and give it a run through.

HTH

B

Posted by agme (agme), 10 December 2004
Thanks a lot !!!
I changed what you told me.
I send you some rules that I wrote in a folder .txt

regleNFDP1      ([\(\[\{][A-Z][a-z]*|[A-Z]*|[a-z]*)([\.\:\;\?\!\:])([A-Z][a-z]*|[A-Z]*|[a-z]*)(\.)([A-Z][a-z]*|[A-Z]*|[a-z]*[\)\]\}])      non_fin_de_phrase      marque_milieu1

regle_FDP_1      (\"[\.;?!])      Fin_De_Phrase      marque_debut

I have problems with the sub appliquer_regle.

Posted by admin (Graham Ellis), 11 December 2004
Hi .... I've had a look through the original post and follow up. and I confess I'm not feeling very brave to jump in here.  Agme - some comments and some sample data would help (I think I'm being hampered too by the language - the variable names are giving me few clues).  I also look at that regular expression in your final post and I'm reminded that "two short regular expressions are much easier to follow that one long one ..."

What's the current status?  You say you have some problems with one particular sub ... have you tried isolating it and testing it in a test harness, adding in print statements to see what it's doing, etc?   That will help you tie the problem down to a line or two of code, then hopefully to solve it.
Posted by agme (agme), 11 December 2004
Hi, Graham !!!
I tested all my rules and they work fine. The sub appliquer_regle works fine.
My new problem is in sub baliser_phrase and sub traiter_paragraphe. I don't know what to do.

Posted by admin (Graham Ellis), 12 December 2004
You wrote:
I don't know what to do.

on 12/11/04 at 08:39:17, Graham Ellis wrote:
... add in print statements to see what it's doing ... that will help you tie the problem down to a line or two of code, then hopefully to solve it.



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

WELL HOUSE CONSULTANTS LTD
404, The Spa • Melksham, Wiltshire SN12 6QL • United Kingdom
PHONE: 01144 1225 708225 • FACSIMLE 01144 1225 707126 • EMAIL: info@wellho.net
You are currently on our United States site. Change your country
Updated Sunday, April 28th 2024 Privacy and Copyright Statement © 2024