Training, Open Source computer languages

This is page http://www.wellho.net/forum/Perl-Programming/Yet-anot ... files.html

Our email: info@wellho.net • Phone: 01144 1225 708225

 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Yet another question about merging files

Posted by pmarkc (pmarkc), 5 March 2008
I have another question about merging files.  I am also new to PERL and this seems like it should be easy enough but I haven't been able to make it work.  I have 2 files with ~16M lines each with 3 columns.  The number of lines are not exactly the same but the x and y min and max are known and are the same in both files.  I would like to combine the 2 files into one file with 4 columns.  If the value exists in either file, print it. if no value exists, leave it blank.  I thought it would be a simple loop but it is taking entirely too long for me to get it too work.  I have tried many available data analysis tools and every one chokes on the size of the data except one and that ones does not allow scripting.  Could you please help me?  I am not a student, just a data junky trying to verify a hypothesis.

Example:

File 1
x,y,z
1,1,1100
1,2,2100
1,4,4100
1,5,5100
1,7,7100

File 2
x,y,z
1,1,1200
1,3,3200
1,4,4200
1,5,5200

Output file:
x,y,z1,z2
1,1,1100,1200
1,2,2100,
1,3,,3200
1,4,4100,4200
1,5,5100,5200
1,7,7100,

BR,
Mark

Posted by admin (Graham Ellis), 5 March 2008
Are the data values (x and y) in the same order in both files, or may they be different?  And are they predicable?  (It looks like the answer is "same" and "yes", but that's a guess from a tiny data sample!).  Is any particular order required for the output?  Suggestions will differ depending on your answer!

Posted by KevinAD (KevinAD), 5 March 2008
~16M

is that 16,000,000 or 16,000?

Posted by pmarkc (pmarkc), 5 March 2008
The X range is 2047-0
The Y range is 0-8191

The data is fast increment Y and then decrement X.

X_phy      Y_phy      VT
2047      0      7000
2047      2      6500
2047      4      6500
2047      5      6700
2047      6      6300
2047      8      6600
2047      9      6800
2047      10      6400
.
.
.
2047      8187      6700
2047      8188      6400
2047      8189      6500
2047      8190      6600
2047      8191      6800
2046      0      6700
2046      1      6800
2046      2      6500
2046      3      7000
2046      4      6600
2046      5      6900
2046      6      6400


There are slightly less than 16777216
records in each file.  Everything I tried blew up due to memory contraints and I was noyt sure it there was a way to read and output without having to load the whole file and work around the occasional missing record in one file or the other.  Thank you for your help.

Posted by pmarkc (pmarkc), 5 March 2008
BTW,  No particular order is required.

Posted by KevinAD (KevinAD), 6 March 2008
The problem with running out of memory is not easy to get around. You may have to try and do this in more than one pass over the files to avoid using too much memory. That would more than likely mean the script will take longer to run, or maybe Graham has a trick up his sleeve.

Posted by admin (Graham Ellis), 6 March 2008
on 03/06/08 at 00:09:40, KevinAD wrote:
... or maybe Graham has a trick up his sleeve.


Mark, Kevin - I can never resist a challenge.

Since both files are in order, you can
* read a record from the SECOND
* loop through each record from the first and
a) if it comes before the current second record, output it
b) if it has the same x and y, merge them and output both, then read another from second
c) if it comes after the current second output record, output second and read another record from second

You need to take care to handle the end conditions correctly, and to handle sections where there are multiple records coming in from one file with no matching record from the other.   Here is a first test:

Code:
open (FH1,"input1");
open (FH2,"input2");

@second = reader(2);

while (@first = reader(1)) {
       while (1) {
               if ($first[0] > $second[0]) {
                       writer(@first);
                       last;
                       }
               if ($first[0] < $second[0]) {
                       writer(@second);
                       @second = reader(2);
                       next;
               }
               merge();
               @second = reader(2);
               writer(@first);
               last;
       }
}
writer(@second) if (@second);
while (@second = reader(2)) {
       writer(@second);
       }

sub reader {
       @rline = split(/\s+/,<FH1>) if ($_[0] == 1);
       @rline = split(/\s+/,<FH2>) if ($_[0] == 2);
       return if (! @rline);
       @rl2 = ($rline[0] * 10000 - $rline[1],@rline,"");
       if ($_[0] == 2) {
               $rl2[4] = $rl2[3];
               $rl2[3] = "";
               }
       return @rl2;
       }

sub writer {
       ($key,@vals) = @_;
       $line = join(",",@vals);
       print ($line,"\n");
       }
sub merge {
       $first[4] = $second[4];
       }


Bad practise in terms of use of globals all over the place - but it worked for me with a set of data that I doctored from your set, Mark, to produce a simple test case.   By the way Mark, I noted that your inputs (second example) were space delimited and your outputs comma delimited, and I have stuck to that.

Posted by admin (Graham Ellis), 6 March 2008
input1:

2047 0 7000
2047 2 6500
2047 4 6500
2047 5 6700
2047 6 6300
2047 8 6600
2047 9 6800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 6500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 4 6600
2046 5 6900
2046 6 6400

input2:

2047 0 7000
2047 2 6500
2047 3 6500
2047 4 8500
2047 5 6700
2047 8 6600
2047 9 8800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 8500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 5 6900
2046 6 6400

result

dolphin:~ graham$ perl bis
2047,0,7000,7000
2047,2,6500,6500
2047,3,,6500
2047,4,6500,8500
2047,5,6700,6700
2047,6,6300,
2047,8,6600,6600
2047,9,6800,8800
2047,10,6400,6400
2047,8187,6700,6700
2047,8188,6400,6400
2047,8189,6500,8500
2047,8190,6600,6600
2047,8191,6800,6800
2046,0,6700,6700
2046,1,6800,6800
2046,2,6500,6500
2046,3,7000,7000
2046,4,6600,
2046,5,6900,6900
2046,6,6400,6400
dolphin:~ graham$

Posted by pmarkc (pmarkc), 6 March 2008
Graham,
 Thank You!  I guess I should have paid more attention when I put in the example data because my files are comma delimited.  But with your help and your code I was able to edit to get exactly what I wanted.  Thank You very much for the code.  I ran it and it took ~10 minutes to run on the full 16M line files and I added delta calculations and default values for the data missing in one file.  Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];  
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
       while (1) {
               if ($first[0] > $second[0]) {  
                       writer(@first);
                       last;
                       }
               if ($first[0] < $second[0]) {
                       writer(@second);
                       @second = reader(2);
                       next;
                         }
               merge();
               @second = reader(2);
               writer(@first);
     last;
       }
}
writer(@second) if (@second);
while (@second = reader(2)) {
       writer(@second);
       }

sub reader {
       @rline = split(/\,/,<FH1>) if ($_[0] == 1);
       @rline = split(/\,/,<FH2>) if ($_[0] == 2);
     chomp(@rline);
       return if (! @rline);
       @rl2 = ($rline[0] * 10000 - $rline[1],@rline);
       if ($_[0] == 2) {
               $rl2[4] = $rl2[3];
               $rl2[3] = "7100";
               }
       return @rl2;
       }

sub writer {
       ($key,@vals) = @_;
           if($vals[2]!= "" && $vals[3]!= "") {
           $vals[4]=$vals[2]-$vals[3];
           }
       $line = join(",",@vals);
       print ($line,"\n");
       }

sub merge {
       $first[4] = $second[4];
       }

Thank You again,
Mark



Posted by pmarkc (pmarkc), 6 March 2008
Graham,
 Thank You!  I guess I should have paid more attention when I put in the example data because my files are comma delimited.  But with your help and your code I was able to edit to get exactly what I wanted.  Thank You very much for the code.  I ran it and it took ~10 minutes to run on the 2 full 16M line files and I added delta calculations and default values for the data missing in one file.  Also, the amount of memory it uses is amazingly small.  Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];  
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
       while (1) {
               if ($first[0] > $second[0]) {  
                       writer(@first);
                       last;
                       }
               if ($first[0] < $second[0]) {
                       writer(@second);
                       @second = reader(2);
                       next;
                         }
               merge();
               @second = reader(2);
               writer(@first);
     last;
       }
}
writer(@second) if (@second);
while (@second = reader(2)) {
       writer(@second);
       }

sub reader {
       @rline = split(/\,/,<FH1>) if ($_[0] == 1);
       @rline = split(/\,/,<FH2>) if ($_[0] == 2);
     chomp(@rline);
       return if (! @rline);
       @rl2 = ($rline[0] * 10000 - $rline[1],@rline);
       if ($_[0] == 2) {
               $rl2[4] = $rl2[3];
               $rl2[3] = "7100";
               }
       return @rl2;
       }

sub writer {
       ($key,@vals) = @_;
           if($vals[2]!= "" && $vals[3]!= "") {
           $vals[4]=$vals[2]-$vals[3];
           }
       $line = join(",",@vals);
       print ($line,"\n");
       }

sub merge {
       $first[4] = $second[4];
       }

Output is:


X_phy,Y_phy,VT,VT
2045,2306,6400,6400,0
2045,2307,7000,6900,100
2045,2308,6300,6300,0
2045,2309,6800,6700,100
2045,2310,6400,6300,100
2045,2311,6700,6600,100
2045,2312,6600,6500,100
2045,2313,7000,6900,100
2045,2314,6400,6300,100
2045,2315,6800,6800,0
2045,2316,6700,6600,100
2045,2318,6600,6600,0
2045,2319,7100,7000,100
2045,2320,6300,6300,0
2045,2321,6700,6700,0

I just realized I need to go back and modifiy the column headers a little.  Anyway, it works great,.

Thank You Again,
Mark



Posted by KevinAD (KevinAD), 6 March 2008
Nicely done Graham



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho