Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
For 2023 - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
segment a text using Perl (help TreeTagger)

Posted by agme (agme), 10 December 2004
I'm a beginner in Perl. I would like to make a program that will be able to segment a text in paragraphs and sentences and then tokenise it to have all the words.
I'll defined 23 rules for not end of sentence (NFDP) and 13 rules end of sentence (FDP). I'll use the regular expression to do it.
The syntax is :
regleNFDP1\t  non-fin-de- phrase\t marque_milieu\n

my ($nom_regle, $regex, $marque, $type)                                  

Example: NFDP
X1= PONCTO = ( . ? ! )
X3=PARENTH=( (), [ ], { })

If                 TXT has X1
AND IF       before X1 there is X3
AND IF       after  X1 there is X3
THEN          X1 is not the end of a sentence
I send what I have done so far, can someone help to correct it and make it work?

#!/usr/bin/perl -w
use strict;
use XML::dOM;

my $fichier = $ARGV[0];
my $regles = $ARGV[1];

open (F,$fichier);
my $texte = join("\n",<F>);
close (F);
main ($texte);

sub main {

     my $texte = shift;
     $texte = segmenter_phrase($texte);
     $texte = baliser_phrase($texte);
     $texte = traiter_paragraphe($texte);
     $texte = generer_docbook($texte);
     print $texte;

sub segmenter_phrase{
     my $texte = shift;
     open (R, $regles) or die "erreur";
     while (my $regles =<R>){
           $texte =appliquer_regle($texte,$regles);
      open(T, ">trace.txt");
      print T $texte;
     return $texte;

#traitement des regles/rule#1
sub appliquer_regle{
     my ($texte,$regles)=@_;
     my @parties = split("\t",$regles);
     my ($nom_regle, $regex, $marque, $type)= @parties;
 if ($type eq "marque_debut"){
     $texte =~ s/$regex/$1{$marque\/$nom_regle\}$2$3/g;}
       elsif ($type eq "marque_milieu"){
      $texte =~ s/$regex/$1$2{$marque\/$nom_regle\}$3/g;}
        else{$texte =~ s/$regex/$1$2$3{$marque\/$nom_regle\}/g;}
         return $texte;
sub baliser_phrase{
     my $texte_marque = shift;
     $texte_marque = ~ s/\{non_fin_de_phrase\/regleNFDP[0-9]+\}//g;
     $texte_marque = "<phrase>" . $texte_marque;
     $texte_marque = ~ s/{fin_de_phrase\/regleFDP[0-9]+\}/<\/phrase><phrase>/g;
    $texte_marque =~ s/<phrase>[ ]*<\/phrase>//g;
$texte_marque = $texte_marque .  "</phrase>";
     return $texte_marque;                            
sub traiter_paragraphe{
   my $texte = shift;
   $texte = "<para>" . $texte;
   $texte =~ s/(\n)+/<\/para>\n<para><phrase>/g;
   $texte = $texte . "</para>"
   $texte =~ s/<para>(\n|[ ])*<\/para>//g;
     $texte =~ s/<\/para>/<\/phrase><\/para>/g;
     $texte =~ s/<phrase>[ ]*<\/phrase>//g;
     $texte =~ s/<para><\/phrase><\/para>//;
     return $texte;
sub generer_docbook{
     my $texte = shift;
     my $entete = "<?xml version="1.0 encoding = 'ISO-8859-1?>\n"?";
          $entete = $entete."<!DOCTYPE article PUBLIC.....>";
     my $texte_docbook = $entete."<article>\n <section>\n";
        $texte_docbook = $texte_docbook "     <title/>";
        $texte_docbook =$texte_docbook "</title>\n";  
   $texte_docbook=$texte_docbook.$texte. "\n  </section>\n<\article>\n";
      return $text_docbook;  

Posted by Custard (Custard), 10 December 2004

I haven't even tried running it, or understanding it, but a quick skim through reveals..

my $fichier = $ARGV[0];
my $regles = $ARGV[0].

These are the same argument on the command line.. Is this what you want?


Is missing the $ sigil, should be $texte.

     $texte = segmenter_phrase($texte);
     $texte = traiter_paragrpahe($texte);
     $texte = generer_docboom($texte);

generer_docboom  probably ought to be generer_docbook.

open(T, ">trace.txt");
 print T$texte;

The print line should have a space after the T, like
print T $texte;

My advice for a happier life would be to put:

use strict;

at the top of your program. This will show up a lot of errors, but once fixed, you will have a program in better shape.
Then I will try and give it a run through.



Posted by agme (agme), 10 December 2004
Thanks a lot !!!
I changed what you told me.
I send you some rules that I wrote in a folder .txt

regleNFDP1      ([\(\[\{][A-Z][a-z]*|[A-Z]*|[a-z]*)([\.\:\;\?\!\:])([A-Z][a-z]*|[A-Z]*|[a-z]*)(\.)([A-Z][a-z]*|[A-Z]*|[a-z]*[\)\]\}])      non_fin_de_phrase      marque_milieu1

regle_FDP_1      (\"[\.;?!])      fin_de_phrase      marque_debut

Posted by admin (Graham Ellis), 11 December 2004
Hi .... I've had a look through the original post and follow up. and I confess I'm not feeling very brave to jump in here.  Agme - some comments and some sample data would help (I think I'm being hampered too by the language - the variable names are giving me few clues).  I also look at that regular expression in your final post and I'm reminded that "two short regular expressions are much easier to follow that one long one ..."

What's the current status?  You say you have some problems with one particular sub ... have you tried isolating it and testing it in a test harness, adding in print statements to see what it's doing, etc?   That will help you tie the problem down to a line or two of code, then hopefully to solve it.

Posted by admin (Graham Ellis), 12 December 2004
You wrote:
I don't know what to do.

on 12/11/04 at 08:39:17, Graham Ellis wrote:
... add in print statements to see what it's doing ... that will help you tie the problem down to a line or two of code, then hopefully to solve it.

Posted by admin (Graham Ellis), 14 December 2004
I don't think I understand enough about the application as a whole and the program "Tree Tagger" to be able to give you a good answer.  Is "Tree Tagger" another program of yours, or a piece of software from elsewhere?  What's it written in, and does it have hooks through which you can tie in your code?

Posted by agme (agme), 15 December 2004
I'm using Eclipse, and when I run the program together with the text that it should be segmented and together with the file where I have all the 36 rules. This is the result I have.

<?xml version='1.0' encoding='ISO-8859-1?>
<!DOCTYPE article PUBLIC...>
<para><phrase>L'Irak meurtri après trois jours de violences intenses.</phrase></para>
<para><phrase>LEMONDE.FR</phrase> | 25.10.04 | 09h49.</phrase></para>
<para><phrase>Consultez nos dossiers, l'analyse approfondie de grands sujets d'actualité. </phrase>

The next step is to send the "phrase" (sentences) to Tree Tagger to have a result morphsyntaxique.  

Posted by Custard (Custard), 15 December 2004

I'm sure we are all confused here..

Is it this Tree Tagger you are using?

If so, one of the examples on the page is

echo 'Das ist ein Test.' | cmd/tagger-chunker-german

This implies that it is a set of commands that you can run and collect the output from:

Is this what you want to do?


Posted by agme (agme), 16 December 2004
I did an update to the program that segment the text. It works fine. I saw the page that you send me and I had already the program TreeTagger but that is what I want and now I have Flemm because I need both of them.

sub TreeTagger {
     print "Fichier de sortie $fichier.tagger.xml\n";
     open( F, ">$fichier.tagger.xml" ) or die "Erreur";
     my $dom_parser = new XML::dOM::parser;
     my $doc = $dom_parser->parsefile($fichier);

# Traitement d'une phrase
     print "Traitement de la phrase ...\n";
     print F "<" .my $noeud->getTagName . ">";
     my $fichier_temp = "temp.txt";
     open( PHRASE, ">$fichier_temp" ) or die "Erreur\n";
     print PHRASE $noeud->getFirstChild()->getData;

# Execution du TreeTagger
     my @tab = ( "tag-french", $fichier_temp, ">$fichier_temp.tag" );
     system @tab;

     my @tab2 = (
           "perl",         "D:/Flemm/flemm.perl",
           "--entree",     "$fichier_temp.tag",
           "--repertoire", "D:/Flemmv2",
           "--sortie",     "$fichier_temp.flemm",
           "--tagger",     "TreeTagger"
     system @tab2;

#Execution de Flemm
     open( G, "$fichier_temp.flemm" ) or die "Erreur";
     while ( my $ligne = <G> ) {
           my @tab = split( "\t", $ligne );
           print F "<mot ";
           if ( $tab[0] eq "\"" ) {
                 print F "token=\'$tab[0]\' ";
           else {
                 print F " token=\"$tab[0]\" ";
           print F " cat=\"$tab[1]\" ";
           if ( $tab[2] eq "\"" ) {
                 print F " lemme=\'$tab[2]\' ";
           else {
                 print F " lemme=\"$tab[2]\" ";
           print F "/>\n";
# End of Tag
    print F "</phrase>";

 print F "</para>";

This is the second part of the program that doesn't work because I don't know how to call the TreeTagger. Can you help me or I still haven't explain myself? Sorry....

Posted by Custard (Custard), 16 December 2004

Can you post the error you are getting.

I know nothing of the Tree Tagger, apart from trying to find out what it was you were using.  So I am not really able to help you there.
It did look though from one of the examples that it takes its input on STDIN,
so in your system calls you might put something like

cat source_text.txt | tag-french
echo $some_text | tag-french

in the command somewhere.

Also, you have a small 'd' in XML:OM which would produce a compile time error.

Have a go at these, and let us know what happens.
And if you get errors, it would be very helpful to see them.
We don't understand your program as well as you do, so we need all the clues we can get to help us to help you.



This page is a thread posted to the opentalk forum at and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2023: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: • WEB: • SKYPE: wellho