Yet another question about merging files

Posted by pmarkc (pmarkc), 5 March 2008

I have another question about merging files. I am also new to PERL and this seems like it should be easy enough but I haven't been able to make it work. I have 2 files with ~16M lines each with 3 columns. The number of lines are not exactly the same but the x and y min and max are known and are the same in both files. I would like to combine the 2 files into one file with 4 columns. If the value exists in either file, print it. if no value exists, leave it blank. I thought it would be a simple loop but it is taking entirely too long for me to get it too work. I have tried many available data analysis tools and every one chokes on the size of the data except one and that ones does not allow scripting. Could you please help me? I am not a student, just a data junky trying to verify a hypothesis.

Example:

File 1
x,y,z
1,1,1100
1,2,2100
1,4,4100
1,5,5100
1,7,7100

File 2
x,y,z
1,1,1200
1,3,3200
1,4,4200
1,5,5200

Output file:
x,y,z1,z2
1,1,1100,1200
1,2,2100,
1,3,,3200
1,4,4100,4200
1,5,5100,5200
1,7,7100,

BR,
Mark

Posted by admin (Graham Ellis), 5 March 2008

Are the data values (x and y) in the same order in both files, or may they be different? And are they predicable? (It looks like the answer is "same" and "yes", but that's a guess from a tiny data sample!). Is any particular order required for the output? Suggestions will differ depending on your answer!

Posted by KevinAD (KevinAD), 5 March 2008

~16M

is that 16,000,000 or 16,000?

Posted by pmarkc (pmarkc), 5 March 2008

The X range is 2047-0
The Y range is 0-8191

The data is fast increment Y and then decrement X.

X_phy Y_phy VT
2047 0 7000
2047 2 6500
2047 4 6500
2047 5 6700
2047 6 6300
2047 8 6600
2047 9 6800
2047 10 6400
.
.
.
2047 8187 6700
2047 8188 6400
2047 8189 6500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 4 6600
2046 5 6900
2046 6 6400

There are slightly less than 16777216
records in each file. Everything I tried blew up due to memory contraints and I was noyt sure it there was a way to read and output without having to load the whole file and work around the occasional missing record in one file or the other. Thank you for your help.

Posted by pmarkc (pmarkc), 5 March 2008

BTW, No particular order is required.

Posted by KevinAD (KevinAD), 6 March 2008

The problem with running out of memory is not easy to get around. You may have to try and do this in more than one pass over the files to avoid using too much memory. That would more than likely mean the script will take longer to run, or maybe Graham has a trick up his sleeve.

Posted by admin (Graham Ellis), 6 March 2008

on 03/06/08 at 00:09:40, KevinAD wrote:

... or maybe Graham has a trick up his sleeve.

Mark, Kevin - I can never resist a challenge.

Since both files are in order, you can
* read a record from the SECOND
* loop through each record from the first and
a) if it comes before the current second record, output it
b) if it has the same x and y, merge them and output both, then read another from second
c) if it comes after the current second output record, output second and read another record from second

You need to take care to handle the end conditions correctly, and to handle sections where there are multiple records coming in from one file with no matching record from the other. Here is a first test:

Code:

open (FH1,"input1");
open (FH2,"input2");

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\s+/,<FH1>) if ($_[0] == 1);
@rline = split(/\s+/,<FH2>) if ($_[0] == 2);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline,"");
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
$line = join(",",@vals);
print ($line,"\n");
}
sub merge {
$first[4] = $second[4];
}

Bad practise in terms of use of globals all over the place - but it worked for me with a set of data that I doctored from your set, Mark, to produce a simple test case. By the way Mark, I noted that your inputs (second example) were space delimited and your outputs comma delimited, and I have stuck to that.

Posted by admin (Graham Ellis), 6 March 2008

input1:

2047 0 7000
2047 2 6500
2047 4 6500
2047 5 6700
2047 6 6300
2047 8 6600
2047 9 6800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 6500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 4 6600
2046 5 6900
2046 6 6400

input2:

2047 0 7000
2047 2 6500
2047 3 6500
2047 4 8500
2047 5 6700
2047 8 6600
2047 9 8800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 8500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 5 6900
2046 6 6400

result

dolphin:~ graham$ perl bis
2047,0,7000,7000
2047,2,6500,6500
2047,3,,6500
2047,4,6500,8500
2047,5,6700,6700
2047,6,6300,
2047,8,6600,6600
2047,9,6800,8800
2047,10,6400,6400
2047,8187,6700,6700
2047,8188,6400,6400
2047,8189,6500,8500
2047,8190,6600,6600
2047,8191,6800,6800
2046,0,6700,6700
2046,1,6800,6800
2046,2,6500,6500
2046,3,7000,7000
2046,4,6600,
2046,5,6900,6900
2046,6,6400,6400
dolphin:~ graham$

Posted by pmarkc (pmarkc), 6 March 2008

Graham,
Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the full 16M line files and I added delta calculations and default values for the data missing in one file. Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\,/,<FH1>) if ($_[0] == 1);
@rline = split(/\,/,<FH2>) if ($_[0] == 2);
chomp(@rline);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline);
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "7100";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
if($vals[2]!= "" && $vals[3]!= "") {
$vals[4]=$vals[2]-$vals[3];
}
$line = join(",",@vals);
print ($line,"\n");
}

sub merge {
$first[4] = $second[4];
}

Thank You again,
Mark

Posted by pmarkc (pmarkc), 6 March 2008

Graham,
Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the 2 full 16M line files and I added delta calculations and default values for the data missing in one file. Also, the amount of memory it uses is amazingly small. Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\,/,<FH1>) if ($_[0] == 1);
@rline = split(/\,/,<FH2>) if ($_[0] == 2);
chomp(@rline);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline);
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "7100";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
if($vals[2]!= "" && $vals[3]!= "") {
$vals[4]=$vals[2]-$vals[3];
}
$line = join(",",@vals);
print ($line,"\n");
}

sub merge {
$first[4] = $second[4];
}

Output is:

X_phy,Y_phy,VT,VT
2045,2306,6400,6400,0
2045,2307,7000,6900,100
2045,2308,6300,6300,0
2045,2309,6800,6700,100
2045,2310,6400,6300,100
2045,2311,6700,6600,100
2045,2312,6600,6500,100
2045,2313,7000,6900,100
2045,2314,6400,6300,100
2045,2315,6800,6800,0
2045,2316,6700,6600,100
2045,2318,6600,6600,0
2045,2319,7100,7000,100
2045,2320,6300,6300,0
2045,2321,6700,6700,0

I just realized I need to go back and modifiy the column headers a little. Anyway, it works great,.

Thank You Again,
Mark

Posted by KevinAD (KevinAD), 6 March 2008

Nicely done Graham

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.