merging files with similar records

Posted by mila1982 (mila1982), 22 November 2006

guys

im totally new to perl and am struggling to understand how to complete the following task ... if you could please help me out i would appreciate it a lot ... this is not homework or anything .. im just trying to understand how perl works ...

i have two files eg. SCI1 and SCI2 .. both of these file contain records .. some of these records are identical with the only difference being that one of the records may have an additional comment in it which needs to be included .. if both records contain different comments, SCI2 will take precedence over SCI1 comments and as such i want to store SCI2's comments together with the rest of the record in the merged file ... so i would like to compare these files, and merge them into a 3rd file ...
how do i compare these two files .. should i base them on a unique key, eg. CSN ??

to make it easier to understand here is an example ..

Field: CSN Time Date Contact Details
SCI2: 2567 0530 3Nov - this is POI record
SCI1: 2567 0530 3Nov - this is a POI record
SCI1: 2567 0730 3 Nov Richard Russell, 300 Madison Ave, Reston, VA

When merged would lead to:
SCI3: 2567 0530 3Nov - this is POI record
SCI3: 2567 0730 3 Nov Richard Mussell, 300 Madison Ave, Reston, VA

Can someone guide me plz on how this could be done ... If you could tell me what i should read about and maybe post some examples, i would appreciate it a lot ...

thanks
milos

Posted by admin (Graham Ellis), 22 November 2006

Merging files / comparing files to look for differences is quite a tricky algorithm - certainly not the one I would choose as my first piece of Perl coding. If you are really doing it as an exercise, you may take a look at my post here and decide that you can find something that's more code intensive and less algorithmically complex to learn on

Still reading?

Question one - how big are your data files - if they're so huge that they won't fit into memory, you'll have to go one way ... otherwise you'll have a choice

Question two - what order are the incoming files in? Can you guarantee that duplicate records will be in the same order in both files

Question three - what about the output order? Any old order? Sorted as per one of the incoming files or in a new way? What to do if you're inserting a new record from SC1? Scenario - SC2 contains records ABDE and SC1 contains records ACE - should the output be ABDCE, ACBDE, ABCDE (how would it know to put C there?) CABDE, ABDEC or what?

OK - here's a scheme
... smallish files
... files NOT assumed to be in order
... output file to contain (a) all SCI2 records then (b) extras from SCI1.

a) Slurp SCI2 and SCI1 into lists
b) Set up a hash of keys of the SCI2 list
c) Filter out all SCI1 records that don't match
d) Output the SCI2 list followed by leftovers from SCI1.

open (FH,"SCI1");
@sci1 = <FH>;
open (FH,"SCI2");
@sci2 = <FH>;

foreach (@sci2) {
($key) =~ (/\S+\s+\S+\s+\S+/);
$table{$key} = 1;
}

@extra = ();
foreach (@sci1) {
($key) =~ (/\S+\s+\S+\s+\S+/);
push @extra, $_ if (! $table{$key});
}

open (FH.">SCI3");
print FH @sci2,@sci1;

[i]Note - my code is untested and you'll probably need to correct a couple of things - but the intent is to give you some pointers as to a first approach to problems like this. The syntax is fine / tested though. If you find the individual Perl statements a bit challenging, then you may want to start with some easier algorithms or one of these

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.