| |||||||||||
| |||||||||||
Yet another question about merging files Posted by pmarkc (pmarkc), 5 March 2008 I have another question about merging files. I am also new to PERL and this seems like it should be easy enough but I haven't been able to make it work. I have 2 files with ~16M lines each with 3 columns. The number of lines are not exactly the same but the x and y min and max are known and are the same in both files. I would like to combine the 2 files into one file with 4 columns. If the value exists in either file, print it. if no value exists, leave it blank. I thought it would be a simple loop but it is taking entirely too long for me to get it too work. I have tried many available data analysis tools and every one chokes on the size of the data except one and that ones does not allow scripting. Could you please help me? I am not a student, just a data junky trying to verify a hypothesis.Example: File 1 x,y,z 1,1,1100 1,2,2100 1,4,4100 1,5,5100 1,7,7100 File 2 x,y,z 1,1,1200 1,3,3200 1,4,4200 1,5,5200 Output file: x,y,z1,z2 1,1,1100,1200 1,2,2100, 1,3,,3200 1,4,4100,4200 1,5,5100,5200 1,7,7100, BR, Mark Posted by admin (Graham Ellis), 5 March 2008 Are the data values (x and y) in the same order in both files, or may they be different? And are they predicable? (It looks like the answer is "same" and "yes", but that's a guess from a tiny data sample!). Is any particular order required for the output? Suggestions will differ depending on your answer!Posted by KevinAD (KevinAD), 5 March 2008 ~16Mis that 16,000,000 or 16,000? Posted by pmarkc (pmarkc), 5 March 2008 The X range is 2047-0The Y range is 0-8191 The data is fast increment Y and then decrement X. X_phy Y_phy VT 2047 0 7000 2047 2 6500 2047 4 6500 2047 5 6700 2047 6 6300 2047 8 6600 2047 9 6800 2047 10 6400 . . . 2047 8187 6700 2047 8188 6400 2047 8189 6500 2047 8190 6600 2047 8191 6800 2046 0 6700 2046 1 6800 2046 2 6500 2046 3 7000 2046 4 6600 2046 5 6900 2046 6 6400 There are slightly less than 16777216 records in each file. Everything I tried blew up due to memory contraints and I was noyt sure it there was a way to read and output without having to load the whole file and work around the occasional missing record in one file or the other. Thank you for your help. Posted by pmarkc (pmarkc), 5 March 2008 BTW, No particular order is required.Posted by KevinAD (KevinAD), 6 March 2008 The problem with running out of memory is not easy to get around. You may have to try and do this in more than one pass over the files to avoid using too much memory. That would more than likely mean the script will take longer to run, or maybe Graham has a trick up his sleeve. Posted by admin (Graham Ellis), 6 March 2008 on 03/06/08 at 00:09:40, KevinAD wrote:
Mark, Kevin - I can never resist a challenge. Since both files are in order, you can * read a record from the SECOND * loop through each record from the first and a) if it comes before the current second record, output it b) if it has the same x and y, merge them and output both, then read another from second c) if it comes after the current second output record, output second and read another record from second You need to take care to handle the end conditions correctly, and to handle sections where there are multiple records coming in from one file with no matching record from the other. Here is a first test: Code:
Bad practise in terms of use of globals all over the place - but it worked for me with a set of data that I doctored from your set, Mark, to produce a simple test case. By the way Mark, I noted that your inputs (second example) were space delimited and your outputs comma delimited, and I have stuck to that. Posted by admin (Graham Ellis), 6 March 2008 input1:2047 0 7000 2047 2 6500 2047 4 6500 2047 5 6700 2047 6 6300 2047 8 6600 2047 9 6800 2047 10 6400 2047 8187 6700 2047 8188 6400 2047 8189 6500 2047 8190 6600 2047 8191 6800 2046 0 6700 2046 1 6800 2046 2 6500 2046 3 7000 2046 4 6600 2046 5 6900 2046 6 6400 input2: 2047 0 7000 2047 2 6500 2047 3 6500 2047 4 8500 2047 5 6700 2047 8 6600 2047 9 8800 2047 10 6400 2047 8187 6700 2047 8188 6400 2047 8189 8500 2047 8190 6600 2047 8191 6800 2046 0 6700 2046 1 6800 2046 2 6500 2046 3 7000 2046 5 6900 2046 6 6400 result dolphin:~ graham$ perl bis 2047,0,7000,7000 2047,2,6500,6500 2047,3,,6500 2047,4,6500,8500 2047,5,6700,6700 2047,6,6300, 2047,8,6600,6600 2047,9,6800,8800 2047,10,6400,6400 2047,8187,6700,6700 2047,8188,6400,6400 2047,8189,6500,8500 2047,8190,6600,6600 2047,8191,6800,6800 2046,0,6700,6700 2046,1,6800,6800 2046,2,6500,6500 2046,3,7000,7000 2046,4,6600, 2046,5,6900,6900 2046,6,6400,6400 dolphin:~ graham$ Posted by pmarkc (pmarkc), 6 March 2008 Graham,Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the full 16M line files and I added delta calculations and default values for the data missing in one file. Here is my final code ( I really should say here is my modifications to your code.) open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0]; open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1]; @second = reader(2); while (@first = reader(1)) { while (1) { if ($first[0] > $second[0]) { writer(@first); last; } if ($first[0] < $second[0]) { writer(@second); @second = reader(2); next; } merge(); @second = reader(2); writer(@first); last; } } writer(@second) if (@second); while (@second = reader(2)) { writer(@second); } sub reader { @rline = split(/\,/,<FH1>) if ($_[0] == 1); @rline = split(/\,/,<FH2>) if ($_[0] == 2); chomp(@rline); return if (! @rline); @rl2 = ($rline[0] * 10000 - $rline[1],@rline); if ($_[0] == 2) { $rl2[4] = $rl2[3]; $rl2[3] = "7100"; } return @rl2; } sub writer { ($key,@vals) = @_; if($vals[2]!= "" && $vals[3]!= "") { $vals[4]=$vals[2]-$vals[3]; } $line = join(",",@vals); print ($line,"\n"); } sub merge { $first[4] = $second[4]; } Thank You again, Mark Posted by pmarkc (pmarkc), 6 March 2008 Graham,Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the 2 full 16M line files and I added delta calculations and default values for the data missing in one file. Also, the amount of memory it uses is amazingly small. Here is my final code ( I really should say here is my modifications to your code.) open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0]; open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1]; @second = reader(2); while (@first = reader(1)) { while (1) { if ($first[0] > $second[0]) { writer(@first); last; } if ($first[0] < $second[0]) { writer(@second); @second = reader(2); next; } merge(); @second = reader(2); writer(@first); last; } } writer(@second) if (@second); while (@second = reader(2)) { writer(@second); } sub reader { @rline = split(/\,/,<FH1>) if ($_[0] == 1); @rline = split(/\,/,<FH2>) if ($_[0] == 2); chomp(@rline); return if (! @rline); @rl2 = ($rline[0] * 10000 - $rline[1],@rline); if ($_[0] == 2) { $rl2[4] = $rl2[3]; $rl2[3] = "7100"; } return @rl2; } sub writer { ($key,@vals) = @_; if($vals[2]!= "" && $vals[3]!= "") { $vals[4]=$vals[2]-$vals[3]; } $line = join(",",@vals); print ($line,"\n"); } sub merge { $first[4] = $second[4]; } Output is: X_phy,Y_phy,VT,VT 2045,2306,6400,6400,0 2045,2307,7000,6900,100 2045,2308,6300,6300,0 2045,2309,6800,6700,100 2045,2310,6400,6300,100 2045,2311,6700,6600,100 2045,2312,6600,6500,100 2045,2313,7000,6900,100 2045,2314,6400,6300,100 2045,2315,6800,6800,0 2045,2316,6700,6600,100 2045,2318,6600,6600,0 2045,2319,7100,7000,100 2045,2320,6300,6300,0 2045,2321,6700,6700,0 I just realized I need to go back and modifiy the column headers a little. Anyway, it works great,. Thank You Again, Mark Posted by KevinAD (KevinAD), 6 March 2008 Nicely done Graham This page is a thread posted to the opentalk forum
at www.opentalk.org.uk and
archived here for reference. To jump to the archive index please
follow this link.
|
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho |