Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
merging files with similar records

Posted by mila1982 (mila1982), 22 November 2006
guys

im totally new to perl and am struggling to understand how to complete the following task ... if you could please help me out i would appreciate it a lot ... this is not homework or anything .. im just trying to understand how perl works ...

i have two files eg. SCI1 and SCI2 .. both of these file contain records .. some of these records are identical with the only difference being that one of the records may have an additional comment in it which needs to be included .. if both records contain different comments, SCI2 will take precedence over SCI1 comments and as such i want to store SCI2's comments together with the rest of the record in the merged file ... so i would like to compare these files, and merge them into a 3rd file ...
how do i compare these two files .. should i base them on a unique key, eg. CSN ??

to make it easier to understand here is an example ..

Field:      CSN      Time      Date                        Contact Details      
SCI2:      2567        0530      3Nov - this is POI record
SCI1:      2567      0530      3Nov - this is a POI record
SCI1:   2567      0730    3 Nov                        Richard Russell, 300 Madison Ave, Reston, VA      
                                                                       
When merged would lead to:                                    
SCI3:      2567      0530      3Nov - this is POI record                  
SCI3:   2567      0730    3 Nov                        Richard Mussell, 300 Madison Ave, Reston, VA      


Can someone guide me plz on how this could be done ... If you could tell me what i should read about and maybe post some examples, i would appreciate it a lot ...

thanks
milos

Posted by admin (Graham Ellis), 22 November 2006
Merging files / comparing files to look for differences is quite a tricky algorithm - certainly not the one I would choose as my first piece of Perl coding.   If you are really doing it as an exercise, you may take a look at my post here and decide that you can find something that's more code intensive and less algorithmically complex to learn on  

Still reading?

Question one - how big are your data files - if they're so huge that they won't fit into memory, you'll have to go one way ... otherwise you'll have a choice

Question two - what order are the incoming files in?   Can you guarantee that duplicate records will be in the same order in both files

Question three - what about the output order? Any old order? Sorted as per one of the incoming files or in a new way?    What to do if you're inserting a new record from SC1?  Scenario - SC2 contains records ABDE and SC1 contains records ACE - should the output be ABDCE, ACBDE, ABCDE (how would it know to put C there?) CABDE, ABDEC or what?

OK - here's a scheme
... smallish files
... files NOT assumed to be in order
... output file to contain (a) all SCI2 records then (b) extras from SCI1.

a) Slurp SCI2 and SCI1 into  lists
b) Set up a hash of keys of the SCI2 list
c) Filter out all SCI1 records that don't match
d) Output the SCI2 list followed by leftovers from SCI1.

open (FH,"SCI1");
@sci1 = <FH>;
open (FH,"SCI2");
@sci2 = <FH>;

foreach (@sci2) {
    ($key) =~ (/\S+\s+\S+\s+\S+/);
    $table{$key} = 1;
    }

@extra = ();
foreach (@sci1) {
      ($key) =~ (/\S+\s+\S+\s+\S+/);
      push @extra, $_ if (! $table{$key});
      }

open (FH.">SCI3");
print FH @sci2,@sci1;

[i]Note - my code is untested and you'll probably need to correct a couple of things - but the intent is to give you some pointers as to a first approach to problems like this.   The syntax is fine / tested though.   If you find the individual Perl statements a bit challenging, then you may want to start with some easier algorithms or one of these  



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho