FILE CONVERSION TO AN ARRAY OF ARRAYS
Posted by geez_itsjustme (geez_itsjustme), 21 February 2005Hi!
Given the follwing file input; where each line represents a Document id, Sentence Id and Noun Phrase from a sentence (to be considered a single string with white spaces).
NYT19990603.0323 1 FED-RIVLIN (Washington) _ Alice Rivlin
NYT19990603.0323 1 the number
NYT19990603.0323 1 2 official
NYT19990603.0323 1 the Federal Reserve Board
NYT19990608.0356 2 Rivlin's marriage
NYT19990608.0356 2 's
NYT19990608.0356 2 Alice Rivlin
NYT19990608.0356 2 the vice chairwoman
NYT19990608.0356 2 the Federal Reserve
NYT19990608.0356 2 divorce
NYT19990608.0356 2 more than 20 years
Any tips on how I can efficiently read in this file into an array of arrays. Where each line is an array with 3 elements. and then put the arrays into an array.
Secondly later I intend to cluster similar noun phrases with attach doc_id and qu_id (I will determine the similarity measure, does have to be exact matches e.g. could be same head noun). By identifying those that I think are similar and hence grouping them together.Probably in an array of arrays.
Thought I would use a hash. Anyway any suggestions on how to Efficiently match each noun Phrase to each other. Considering that I'll have atleast 500 noun phrases to match (500*500)
Thanks in advance
Posted by admin (Graham Ellis), 22 February 2005Hi, the following example should help you get started - I've read your data in and set up a list of lists and also a hash that contains the data where you can look it up based on the phrase. In order to represent your similarity grouping, I've "squashed case" and removed white space - you might want to look at a metaphone or soundex system as we do on our site search.
You are concerned about 500 x 500 searches. You don't need to - the hash provides you with a better and far more efficient way. I duplicated your data up to 500 lines during testing and it came straight back to me, completed. And I'm using just an ordinary little laptop.
Just to note - I duplicated and altered one line of data to test and demonstrate the "fuzzy" matching and multiple hits to the same phrase.
Posted by admin (Graham Ellis), 22 February 2005Just to add - here's the output from my example program ...
Posted by geez_itsjustme (geez_itsjustme), 23 February 2005Hi!
Thanks for da lead. However I've realised I didnt give a very clear picture of program intention. In essence I'm attempting to group (note dont want to get rid of similar noun phrases) "similiar" Noun phrases together. The similairity in this case is not the doc_id nor the sentence_id but the noun phrase itself. The attached id's are for later reference; to identify which doc and sentence it was obtained from.
Similarity of each Noun Phrase could be exact word match, 50% word match, or same head noun (last word in noun phrase) etc. Hence depending on the similarity measure, I'll group similar noun phrases to gether probably in a an array.
hence the plan is to input the previous file into an array of arrays, then do the matching and output into another array of arrays. (or probably use the same array/hash and "sort", in this case apply similarity measure). The reason I want to output all the similar NPs to an array is because later I will want the frequency(length of array+1) .
So, my question is how can I match each NP which is a single string (be able to identify each word in the string separately) to other NP. Then put all similar NPs in an array. Or any other tips on the general program...
Posted by geez_itsjustme (geez_itsjustme), 23 February 2005EXAMPLE
given the following list of noun phrases obtained from to sentences describing the same person. You can see that there is a word overlap in in the three identified NP's. Hence these are similar sentences are clustered and frequency is obtained. Seen further below.
*Federal Reserve Vice Chairman Alice Rivlin
more than one interview
the current stock market
*the Federal Reserve
the central bank effective July 16
3 - Federal Reserve Vice Chairman
1 - more than one interview
1 - the current stock market
1 - any kind
1 - valuation
1 - Last week
1 - whom1 - Clinton
1 - her resignation
1 - the central bank effective July 16
The NP "federal reserve vice chairman" has three instances (intended length of an array) however the rest have only on instance.
hope dat puts it into perspective.
Posted by admin (Graham Ellis), 23 February 2005Do have a look at metaphones, soundex, levenshtein distances, etc - for example starting with the link I gave your earlier; that's to how I've done some similar stuff in PHP rather than Perl, but the principle is similar.
You might also want to get involved with noun lists and grouping code that will eliminate words like "the" and "a" and "is".
If you have to match similar numbers too, have a look at http://www.wellho.net/mouth/202_Searching-for-numbers.html
I would be very tempted to group by metaphone.
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: firstname.lastname@example.org • WEB: http://www.wellho.net • SKYPE: wellho