Further Information:
Home
What's New
Resource Centre
WHC Library
Opentalk Forum
About Us
AA
AA
AA
AA
Accessability
|
Perl Programming - Perl, mod-perl, CGI, etc.
segment a texte using Perl (help)
Posted by agme (agme), 10 December 2004I'm a beginner in Perl. I would like to make a program that will be able to segment a text in paragraphs and sentences and then tokenise it to have all the words. I'll defined 23 rules for not end of sentence (NFDP) and 13 rules end of sentence (FDP). I'll use the regular expression to do it. The syntax is : regleNFDP1\t ($regex)\t non-fin-de- ($nom_regle) ($regex) ($marque) phrase\t marque_milieu\n ($type)
Example: NFDP X1= PONCTO = ( . ? ! ) X3=PARENTH=( (), [ ], { })
If TXT has X1 AND IF before X1 there is X3 AND IF after X1 there is X3 THEN X1 is not the end of a sentence I send what I have done so far, can someone help to correct it and make it work?
#!/usr/bin/perl -w
my $fichier = $ARGV[0]; my $regles = $ARGV[0].
open (F,$fichier); my $texte = join("\n",<F>); close (F); main(texte);
sub main { my @PONCTO = ('\.','?','!'); my @PONCT1 = ('\.',';','?','!',':'); my @PONCT2 = ('\.',';','?','!',); my @PARENTH = ('(',')','{','}','[',']'); my @ABREV =('ap','av','cf','env','ex','col','coll','diagr','éd','graph','fig','ill','l','n','p','pp','ouvr','paragr','sect','tabl','vol','M','MM'); my @ABREVDPT =('exemple','1','2','3','\"','','ceci','Mail','mail','adresse','[N. B. ]','remarque','solution','appeler','nommer','ainsi','voici','comme','voilà','celui-ci');
my $texte = shift; $texte = segmenter_phrase($texte); $texte = traiter_paragrpahe($texte); $texte = generer_docboom($texte); return $texte }
sub segmenter_phrase{ my $texte = shift; open (R, $regles) or die "erreur"; while (my $regles =<R>){ $texte =appliquer_regle($texte,$regles); chomp($regles); open(T, ">trace.txt"); print T$texte; close(T); return $texte; }
#traitement des regles/rule#1 sub appliquer_regle{ my ($texte,$regle)=@_; my @ parties = split("\t",$regle); my ($nom_regle, $regex, $marque, $type)= @ parties; if ($type eq "marque_debut"){ $texte = ~s/$regex/$1{$marque/$nom_regle}$2$3/g;} else if ($type eq "marque_milieu"){ $texte = ~s/$regex/$1$2{$marque/$nom_regle}$3/g;} else{ $texte = ~s/$regex/$1$2$3{$marque/$nom_regle}/g;} return $texte; } sub baliser_phrase{ my $texte_marque = shift; $texte_marque = ~s/\{NFDP|/regle NFDP[0-9]+\}//g; $texte_marque = "<phrase>".$$texte_marque; $texte_marque = ~s/\{FDP|/regle FDP[0-9]+\}/<\/phrase><phrase>/g; $texte_marque = $texte_maque. "</phrase>"; $texte_marque = ~s/<phrase>(\n)*<\/phrase>//; return $texte_marque; } sub traiter_paragraphe{ $texte = shift; $texte = "<para>".$texte; $texte = ~s/(\n)/<\/para><para>/g; $texte = $texte."</para>" $texte = ~s/<para></para>//; return $texte; } sub generer_docbook{ my $texte = shift; my $entete = "<?xml version=\"1.0\>"?"; $entete = $entete."!DOCTYPE article PUBLIC....."; my $texte_docbook = $entete."<article>\n<section>"; $texte_docbook = $texte_docbook "<totle/>"; $texte_docbook = $texte_docbook.$texte. "</section>\n<\article>"; return $text_docbook; }
Posted by Custard (Custard), 10 December 2004Hello.
I haven't even tried running it, or understanding it, but a quick skim through reveals..
Code:my $fichier = $ARGV[0]; my $regles = $ARGV[0]. |
|
These are the same argument on the command line.. Is this what you want?
Code:
Is missing the $ sigil, should be $texte.
Code: $texte = segmenter_phrase($texte); $texte = traiter_paragrpahe($texte); $texte = generer_docboom($texte); |
|
generer_docboom probably ought to be generer_docbook.
Code:open(T, ">trace.txt"); print T$texte; close(T); |
|
The print line should have a space after the T, like print T $texte;
My advice for a happier life would be to put:
use strict;
at the top of your program. This will show up a lot of errors, but once fixed, you will have a program in better shape. Then I will try and give it a run through.
HTH
B
This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
|
|