Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
This week, we're updating our course layouts and descriptions. Presentation and materials always gently change over time, but just occasionally there's a need to make a step change to clear out some of the old and roll in the new. That's now happening - but over a long and complex site it's not instant and you'll see sections of the site changing up to and including 19th September.

See also [here] for status update
 
FILE CONVERSION TO AN ARRAY OF ARRAYS

Posted by geez_itsjustme (geez_itsjustme), 21 February 2005
Hi!

Given the follwing file input; where each line represents a Document id, Sentence Id and Noun Phrase from a sentence (to be considered a single string with white spaces).


NYT19990603.0323 1 FED-RIVLIN (Washington) _ Alice Rivlin
NYT19990603.0323 1 the number
NYT19990603.0323 1 2 official
NYT19990603.0323 1 the Federal Reserve Board
NYT19990608.0356 2 Rivlin's marriage
NYT19990608.0356 2 's
NYT19990608.0356 2 Alice Rivlin
NYT19990608.0356 2 the vice chairwoman
NYT19990608.0356 2 the Federal Reserve
NYT19990608.0356 2 divorce
NYT19990608.0356 2 more than 20 years


Any tips on how I can efficiently read in this file into an array of arrays. Where each line is an array with 3 elements. and then put the arrays into an array.

Secondly later I intend to cluster similar noun phrases with attach doc_id and qu_id (I will determine the similarity measure, does have to be exact matches e.g. could be same head noun). By identifying those that I think are similar and hence grouping them together.Probably in an array of arrays.

Thought I would use a hash. Anyway any suggestions on how to Efficiently match each noun Phrase to each other. Considering that I'll have atleast 500 noun phrases to match (500*500)

Thanks in advance

lulu lili

Posted by admin (Graham Ellis), 22 February 2005
Hi, the following example should help you get started - I've read your data in and set up a list of lists and also a hash that contains the data where you can look it up based on the phrase.  In order to represent your similarity grouping, I've "squashed case" and removed white space - you might want to look at a metaphone or soundex system as we do on our site search.

You are concerned about 500 x 500 searches. You don't need to - the hash provides you with a better and far more efficient way. I duplicated your data up to 500 lines during testing and it came straight back to me, completed.  And I'm using just an ordinary little laptop.

Code:
# Sample program by Graham Ellis
# of Well House Consultants

while ($line = <DATA>) {
       chop ($line);

# Setting up a list of lists with the data
       my @parts = split(/\s+/,$line,3);
       push @table,\@parts;

# Setting up a hash based on phrases
       my @dateanpage = @parts[0,1];
       $squashedkey = lc($parts[2]);
       $squashedkey =~ s/\s//g;
       push @{$byphrase{$squashedkey}},\@parts;
       }

# Test a few elements out by printing them

# @table - a list of articles. Each row has elements
# 0 - issues 1 - page 2 - text

print $table[0][0],"\n";
print $table[0][1],"\n";
print $table[0][2],"\n";
print $table[9][0],"\n";

print "==============\n";

# %byphrase - a hash of articles
# Key - the phrase
# contains a list of data pairs for each matching article
# Second subscript - 0 for the issue and 1 for the page

print $byphrase{"thenumber"}[0][0],"\n";
print $byphrase{"thenumber"}[0][1],"\n";
print $byphrase{"thenumber"}[0][2],"\n";
print $byphrase{"thenumber"}[1][0],"\n";
print $byphrase{"thenumber"}[1][1],"\n";
print $byphrase{"thenumber"}[1][2],"\n";
print $byphrase{"divorce"}[0][0],"\n";


__END__
NYT19990603.0323 1 FED-RIVLIN (Washington) _ Alice Rivlin
NYT19990603.0323 1 the number
NYT19990603.0323 1 2 official
NYT19990603.0323 1 the Federal Reserve Board
NYT19990608.0356 2 Rivlin's marriage
NYT19990608.0356 2 's
NYT19990608.0356 2 Alice Rivlin
NYT19990608.0356 2 the vice chairwoman
NYT19990608.0356 2 the Federal Reserve
NYT19990608.0356 2 divorce
NYT19990608.0356 2 more than 20 years
NYT19990713.0378 4 TheNumber


Just to note - I duplicated and altered one line of data to test and demonstrate the "fuzzy" matching and multiple hits to the same phrase.

Posted by admin (Graham Ellis), 22 February 2005
Just to add - here's the output from my example program ...

Code:
earth-wind-and-fire:~/feb05 grahamellis$ perl lol
NYT19990603.0323
1
FED-RIVLIN (Washington) _ Alice Rivlin
NYT19990608.0356
==============
NYT19990603.0323
1
the number
NYT19990713.0378
4
TheNumber
NYT19990608.0356
earth-wind-and-fire:~/feb05 grahamellis$


Posted by geez_itsjustme (geez_itsjustme), 23 February 2005
Hi!

Thanks for da lead. However I've realised I didnt give a very clear picture of program intention. In essence I'm attempting to group (note dont want to get rid of similar noun phrases) "similiar" Noun phrases together. The similairity in this case is not the doc_id nor the sentence_id but the noun phrase itself. The attached id's are for later reference; to identify which doc and sentence it was obtained from.

Similarity of each Noun Phrase could be exact word match, 50% word match, or same head noun (last word in noun phrase) etc. Hence depending on the similarity measure, I'll group similar noun phrases to gether probably in a an array.

hence the plan is to input the previous file into an array of arrays, then do the matching and output into another array of arrays. (or probably use the same array/hash and "sort", in this case apply similarity measure). The reason I want to output all the similar NPs  to an array is because later I will want the frequency(length of array+1) .

So, my question is how can I match each NP which is a single string (be able to identify each word in the string separately) to other NP. Then put all similar NPs in an array. Or any other tips on the general program...

thanks,

lili lulu  

Posted by geez_itsjustme (geez_itsjustme), 23 February 2005
EXAMPLE
given the following list of noun phrases obtained from to sentences describing the same person. You can see that there is a word overlap in in the three identified NP's. Hence these are similar sentences are clustered and frequency is obtained. Seen further below.


*Federal Reserve Vice Chairman Alice Rivlin
more than one interview
the current stock market
any kind
valuation
Last week
*Alice Rivlin
whom
Clinton
appointed
*vice chairman
*the Federal Reserve
her resignation
the central bank effective July 16

******************************************

3 - Federal Reserve Vice Chairman
1 - more than one interview
1 - the current stock market
1 - any kind
1 - valuation
1 - Last week
1 - whom1 - Clinton
1 - her resignation
1 - the central bank effective July 16


The NP "federal reserve vice chairman" has three instances (intended length of an array) however the rest have only on instance.

hope dat puts it into perspective.

lulu lili

Posted by admin (Graham Ellis), 23 February 2005
Do have a look at metaphones, soundex, levenshtein distances, etc - for example starting with the link I gave your earlier;  that's to how I've done some similar stuff in PHP rather than Perl, but the principle is similar.

You might also want to get involved with noun lists and grouping code that will eliminate words like "the" and "a" and "is".

If you have to match similar numbers too, have a look at http://www.wellho.net/mouth/202_Searching-for-numbers.html

Quote:
So, my question is how can I match each NP which is a single string (be able to identify each word in the string separately) to other NP. Then put all similar NPs in an array. Or any other tips on the general program...


I would be very tempted to group by metaphone.  



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2014: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 899360 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho