Yet another question about merging files

Training, Open Source computer languages

Perl • PHP • Python • MySQL • Apache / Tomcat • Tcl • Ruby • Java • C and C++ • Linux • CSS

Home

Accessibility

For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))

Yet another question about merging files

Posted by pmarkc (pmarkc), 5 March 2008

I have another question about merging files. I am also new to PERL and this seems like it should be easy enough but I haven't been able to make it work. I have 2 files with ~16M lines each with 3 columns. The number of lines are not exactly the same but the x and y min and max are known and are the same in both files. I would like to combine the 2 files into one file with 4 columns. If the value exists in either file, print it. if no value exists, leave it blank. I thought it would be a simple loop but it is taking entirely too long for me to get it too work. I have tried many available data analysis tools and every one chokes on the size of the data except one and that ones does not allow scripting. Could you please help me? I am not a student, just a data junky trying to verify a hypothesis.

Example:

File 1
x,y,z
1,1,1100
1,2,2100
1,4,4100
1,5,5100
1,7,7100

File 2
x,y,z
1,1,1200
1,3,3200
1,4,4200
1,5,5200

Output file:
x,y,z1,z2
1,1,1100,1200
1,2,2100,
1,3,,3200
1,4,4100,4200
1,5,5100,5200
1,7,7100,

BR,
Mark

Posted by admin (Graham Ellis), 5 March 2008

Are the data values (x and y) in the same order in both files, or may they be different? And are they predicable? (It looks like the answer is "same" and "yes", but that's a guess from a tiny data sample!). Is any particular order required for the output? Suggestions will differ depending on your answer!

Posted by KevinAD (KevinAD), 5 March 2008

~16M

is that 16,000,000 or 16,000?

Posted by pmarkc (pmarkc), 5 March 2008

The X range is 2047-0
The Y range is 0-8191

The data is fast increment Y and then decrement X.

X_phy Y_phy VT
2047 0 7000
2047 2 6500
2047 4 6500
2047 5 6700
2047 6 6300
2047 8 6600
2047 9 6800
2047 10 6400
.
.
.
2047 8187 6700
2047 8188 6400
2047 8189 6500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 4 6600
2046 5 6900
2046 6 6400

There are slightly less than 16777216
records in each file. Everything I tried blew up due to memory contraints and I was noyt sure it there was a way to read and output without having to load the whole file and work around the occasional missing record in one file or the other. Thank you for your help.

Posted by pmarkc (pmarkc), 5 March 2008

BTW, No particular order is required.

Posted by KevinAD (KevinAD), 6 March 2008

The problem with running out of memory is not easy to get around. You may have to try and do this in more than one pass over the files to avoid using too much memory. That would more than likely mean the script will take longer to run, or maybe Graham has a trick up his sleeve.

Posted by admin (Graham Ellis), 6 March 2008

on 03/06/08 at 00:09:40, KevinAD wrote:

... or maybe Graham has a trick up his sleeve.

Mark, Kevin - I can never resist a challenge.

Since both files are in order, you can
* read a record from the SECOND
* loop through each record from the first and
a) if it comes before the current second record, output it
b) if it has the same x and y, merge them and output both, then read another from second
c) if it comes after the current second output record, output second and read another record from second

You need to take care to handle the end conditions correctly, and to handle sections where there are multiple records coming in from one file with no matching record from the other. Here is a first test:

Code:

open (FH1,"input1");
open (FH2,"input2");

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\s+/,<FH1>) if ($_[0] == 1);
@rline = split(/\s+/,<FH2>) if ($_[0] == 2);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline,"");
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
$line = join(",",@vals);
print ($line,"\n");
}
sub merge {
$first[4] = $second[4];
}

Bad practise in terms of use of globals all over the place - but it worked for me with a set of data that I doctored from your set, Mark, to produce a simple test case. By the way Mark, I noted that your inputs (second example) were space delimited and your outputs comma delimited, and I have stuck to that.

Posted by admin (Graham Ellis), 6 March 2008

input1:

2047 0 7000
2047 2 6500
2047 4 6500
2047 5 6700
2047 6 6300
2047 8 6600
2047 9 6800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 6500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 4 6600
2046 5 6900
2046 6 6400

input2:

2047 0 7000
2047 2 6500
2047 3 6500
2047 4 8500
2047 5 6700
2047 8 6600
2047 9 8800
2047 10 6400
2047 8187 6700
2047 8188 6400
2047 8189 8500
2047 8190 6600
2047 8191 6800
2046 0 6700
2046 1 6800
2046 2 6500
2046 3 7000
2046 5 6900
2046 6 6400

result

dolphin:~ graham$ perl bis
2047,0,7000,7000
2047,2,6500,6500
2047,3,,6500
2047,4,6500,8500
2047,5,6700,6700
2047,6,6300,
2047,8,6600,6600
2047,9,6800,8800
2047,10,6400,6400
2047,8187,6700,6700
2047,8188,6400,6400
2047,8189,6500,8500
2047,8190,6600,6600
2047,8191,6800,6800
2046,0,6700,6700
2046,1,6800,6800
2046,2,6500,6500
2046,3,7000,7000
2046,4,6600,
2046,5,6900,6900
2046,6,6400,6400
dolphin:~ graham$

Posted by pmarkc (pmarkc), 6 March 2008

Graham,
Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the full 16M line files and I added delta calculations and default values for the data missing in one file. Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\,/,<FH1>) if ($_[0] == 1);
@rline = split(/\,/,<FH2>) if ($_[0] == 2);
chomp(@rline);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline);
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "7100";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
if($vals[2]!= "" && $vals[3]!= "") {
$vals[4]=$vals[2]-$vals[3];
}
$line = join(",",@vals);
print ($line,"\n");
}

sub merge {
$first[4] = $second[4];
}

Thank You again,
Mark

Posted by pmarkc (pmarkc), 6 March 2008

Graham,
Thank You! I guess I should have paid more attention when I put in the example data because my files are comma delimited. But with your help and your code I was able to edit to get exactly what I wanted. Thank You very much for the code. I ran it and it took ~10 minutes to run on the 2 full 16M line files and I added delta calculations and default values for the data missing in one file. Also, the amount of memory it uses is amazingly small. Here is my final code ( I really should say here is my modifications to your code.)

open (FH1,$ARGV[0]) or die "Cannot open the file: $! " ,$ARGV[0];
open (FH2,$ARGV[1]) or die "Cannot open the file: $! " ,$ARGV[1];

@second = reader(2);

while (@first = reader(1)) {
while (1) {
if ($first[0] > $second[0]) {
writer(@first);
last;
}
if ($first[0] < $second[0]) {
writer(@second);
@second = reader(2);
next;
}
merge();
@second = reader(2);
writer(@first);
last;
}
}
writer(@second) if (@second);
while (@second = reader(2)) {
writer(@second);
}

sub reader {
@rline = split(/\,/,<FH1>) if ($_[0] == 1);
@rline = split(/\,/,<FH2>) if ($_[0] == 2);
chomp(@rline);
return if (! @rline);
@rl2 = ($rline[0] * 10000 - $rline[1],@rline);
if ($_[0] == 2) {
$rl2[4] = $rl2[3];
$rl2[3] = "7100";
}
return @rl2;
}

sub writer {
($key,@vals) = @_;
if($vals[2]!= "" && $vals[3]!= "") {
$vals[4]=$vals[2]-$vals[3];
}
$line = join(",",@vals);
print ($line,"\n");
}

sub merge {
$first[4] = $second[4];
}

Output is:

X_phy,Y_phy,VT,VT
2045,2306,6400,6400,0
2045,2307,7000,6900,100
2045,2308,6300,6300,0
2045,2309,6800,6700,100
2045,2310,6400,6300,100
2045,2311,6700,6600,100
2045,2312,6600,6500,100
2045,2313,7000,6900,100
2045,2314,6400,6300,100
2045,2315,6800,6800,0
2045,2316,6700,6600,100
2045,2318,6600,6600,0
2045,2319,7100,7000,100
2045,2320,6300,6300,0
2045,2321,6700,6700,0

I just realized I need to go back and modifiy the column headers a little. Anyway, it works great,.

Thank You Again,
Mark

Posted by KevinAD (KevinAD), 6 March 2008

Nicely done Graham

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

Public Training Courses

Running regularly at our UK training Centre.
[Schedule] - [About] - [Book]

Other Forum Posts

Reading multiple lines after a pattern match

Time out setting

how to make permanent button

Cookies or ?

Field comparison

hostname substitution to form url in xml

Earth to Graham

Perl.exe Entry Point Not Found

Help on this perl requirement

Error while using use srict

Local and GMT Time Difference

Command to find LOC

find and replace

group the similar item

Ping code

Perl and mySQL select probmes

working with csv columns in Perl

parsing using 2d array

code for accessing columns

Counting the occurance of an element ian array

Stop Words

how to read log and display in csv file using perl

Adding comma(,) after every number+ perl

Regarding how to use time interval in ms.

Using perl how to open the browser page

Module or Library File

question about array

Flocking

working data sorting

dd command along with ftp in perl

redirest FTP output to a text file using perl

Renaming and Editing Multiple files

read file character by character

delete

FTP using perl

Using a character string count

how to extract the abbreviation from a given text

Odd linebreak

Yet another question about merging files

perl Script for connecting remote computer

counts ans summary the duplicates

question

merge two CSV file's data in one CSV File

copying the contents within the links

Server Side Includes

Read INI

finding position other than 1st using index

Replacing parts of lines

mirror.pl

problem in do until loop

how to join two sets of data?

Matches and mismatches in perl

help in perl

Perl and SSH2 problem

whats wrong??

Please Help on Perl Scripting.

how to get clusters

How to download files in Perl

How to get the mean?

How to take the data from array

Dealing with a large number of dictionaries.

Creating a random series of numbers with perl.

PERL Script to extract the data from word pad.

PERL PROGRAM

reading multiple lines

first 2 words only!!!!!!!!!!!!

removing and counting duplicates in file

to make work faster??

populating hash from a file??

discard the first line

input from another program

help on data retrival

GD::Graph

script

fragment of data extraction

file writing

to find the matches

csv file to table file

to move in a line

Mysql in Perl

Perl and HMTL Code Fix

error in execution.

Help convert from c shell to perl

filehandlers in perl

Error with CGI on Tomcat

search for a word and store next 2 decimal numbers

Help on perl module

A question about user input

LWP::UserAgent Get a page after authenicating SSL

wats wrong in this expression

How to read a Word file and parse data

reguler exprssion

USE USERNAME FROM LOGIN FORM TO THE 3rd PAGE WEB ?

how to get jus the dna sequnce

creating relations between tables

how to parsse this

How Convert Cross Link ID using Perl

how to change?

send a variable from perl to html

Calling a perl script from another perl script

adding comas and newline into write out data

How to check this? :(

grep

ssh using exec()

whats the error?

re-arranging file info and written into another fi

Mechanism of foreach iteration..

Comin out of a directory..

Back references

Renaming files

modifying files..

any option other than substr

using reg exp with index()

display content in file to listbox

graphing with perl

Globbing

find modified file since a specific date

Setting up multiple directories

HTML Tokeparser problem

How to install a module in local linux machine

how to link html file from perl program

CHMOD query

Replace block of HTML

how to read and display from csv file using perl

Parent process unable to read messages from child

trying to create test file of existing cgi page --

CGIWrap encountered an error  

A case of a missing file that is somehow refrenced

Apostrophes

Variable call subroutine?

Modifying a perl file upload script to handle 10 f

Extracting difference in hours...

Curses

unable to calculate file size greater than 1 GB us

merging files with similar records

adjust volume?

DB Files

Module to grab html pages

Perl - Bit Manipulation

How to write a module?

Extract first 20 words

parsing of 8bit binary data bitwise  

Conversion of Hex bytes to binary and vice versa

Help in regular expression

BB Code

Brain Hurts .. Forcing PERL to write to a logfil

Parsing of Hex bytes in Perl

reading latest file

Column alignment

detecting bots shoppers

Summation of file size according to user

Perl - remove unwanted lines

parsing outlook public folders

passing a variable from PERL into an HTML form

Simple Exact string match

Perl to export tab delimted file to Quattro Pro

Loop query

single to multiple users

Importing a regular expression from a file

Upgrading ActiveState Perl

Perl help needed

Net::FTP and forks, Win32

Moved: Just try to solveit, if you can solve

FTP GET for a large file from mainframe.

Finding Modified dir's top level (Windows

Expiry Script

DBI module missing

Calling external programs

Connecting to MySQL over an unreliable network

Emailing from Perl on Windows platforms

Slicing a long number

Cheers etc.

Perl and .NET Cryptography

Linebreaking in html

help with getOpts

help with creating 2D array in perl requested

stuck making a small change to a script

cannot move file using file::copy module

simple reg exp

perl code for btree

detecting browser cookies

More details link to reveal a data column

Perl CGI scripts with SSI in Tomcat

How to clean up week old log files using perl

Perl and MYSQL

how to close a comport in perl?

HTML Data

SSI inside an SSI

Running Perl CGI scripts under Apache Tomcat

Very Very Urgent

Happy Christmas

Regular Expression Help

Windows Mailer

reading a text file and extracting info

Removing blank columns in csv file

Missing modules problem on host

vnukelog from within Perl

CGItemp files

Perl and Guestbook

Perl and excel file HELP PLZ

setting Sticky bit on perl script

Perl and XML

FTP from Perl?

Please find me a solution with this problem

amazing Perl "one liner"

email email

Detecting if file is Macintosh, Windows or Unix

RegEx in scalar

IO::Socket::INET same port

Tk button auto repeat

How to create a file with specific date and time

Send variable to another file

Using Grep to search through an array ?

replace text in a large text file

Is this possible?

sequence file into a numeric file

Cleaning out files for a specified user

How to use Perl references?

Use Perl to create symlinks

Issue with Grep

work on files in a directory

Eights and nines

please help us to beta-test new Perl Editor

Problem with localtime

problem installing perl modules

CHMOD a perl script created file

Finding the latest version number

Writting a script to break a file

Thanks

Attach text file to email

Unliked characters

sorting the spread sheet

Hash of FileNames and FileHandlers

Playing Wav/Mp3s when users visit my Perl site

How to split a file but not line by line

Executing External commands on Unix

Sorting a list of hashes

Win32::MIDI

convert xyz cordinates to distance

Like a real navigator ?

Is this a hash in an SQL insert statement?

Reduce the time taken for Huge Log files

Passing several lists to a sub

Moved: UK Perl Job

updating a file

Simple script to strip out MD5 checksums

perl script

how to tokenize in perl

FILE CONVERSION TO AN ARRAY OF ARRAYS

Delete specific file in directory.

Using CPAN

Filter Large Log Files.

HTML code in Perl

Help with this code please

Perl Regex One-Liner To Substitute Multiline Text

Question regarding %_ ?

perl coding standards

Guestbook.pl

segment a text using Perl (help TreeTagger)

get input onchange into link

Is there most correct way save http request as xml

perllocal.pod

shortest match between XML tags

Safety of actevex by ActiveState PerlCtrl

Read all but the first line from a file

change a perl prog

using one installation of perl from many machines

converting from c-shell

Single quoted command. OK in test, fails in CGI

How to Modify the Windows Registry

Dialling codes for Ireland - a quick Perl Script

perl basic

Eliminate single blank lines

HTML link define variable for search script?

Fonctionnement Perl/Tkto

Send a sound...

how to compare word for word

Retired web pages

Moved: MIB

Perl / Database

User creation and password setting in Perl code

Processing email "bounce backs"

Extracting some fields from a data File

Need some help with CGI script to automate ssh

Gradual output from cgi script

public beta testing of EnginSite Perl Editor.

Ping from Perl?

Interpolating text from a file

merging two files

extracting from an ascii file

(bmp, ...) 2 jpg

Excel Download Perl CGI

How to generate an error message and exit

convert a MS Word doc into multiple HTML pages

Using Perl to compile Java

Capturing repeating values in reg exp!

mod_perl config problem with apache ?

References and variables in Perl - a summary

Monitoring a Perl program

Operating on every scalar in a list

patterns in patterns

empty file on upload

Help please with Perl + regular expressions

Editing multiple files

HTTP

Image attributes from external server?

Finding Internet Explorer version number

Identifying a web site visitor

Perl & a Database

Quoting

Authentication problems

subset a list

HTML entity encoding

GD - GD::Graph

Elusive bug, modified yabb post.pl

Reading the Registry (Microsoft Windows)

Microsoft windows - what version is running?

"AND" and "OR" in regular expressions

perl modules for c libraries

Sorting Data

Removing duplicate lines with a regular expression

Perl script to Auto config Browser

CDS regions: hashes vs. Set::IntSpan

Odd variable naming question

Moved: calling system() in cgi

detecting drives

Sensing when a user leaves a web site

Aaaaaaargh! I am losing my head!

Extracting for notification

Capturing STDERR with backtics

Text formatting

Question about converting time

cross referencing

RegEx problem - finding EOL

Array references

Language parsing

max array size

Calling one cgi script from within another

RegEx

Stricht

Looking for the Paragraph subroutine

Loops

Cannot install perl module on Linux RH 8

Server response

Looking for book examples

Installing Perl 5.8 from rpm

Use of qq{ to prevent Mysql injection

Standard Class

() [] {} or <> ?

Greedy v Global on regular expressions

Overwriting files

return part of a string that matches a pattern?

New line characters - beware!

Context

Deleting a file

Do something at least once!

Finding all matches to a Regular Expression

Variable formatting - beyond printf?

Another large text file example

Say what you mean in regular expressions

Reformatting question

Running another app with a changed environment

How pass a variable from a form to a subroutine

crypt function~disallow the '/' character

Finding big files in or below a directory

Text strings - single or double quotes?

Technical Overview - Perl 6

Regular Expression Efficiency

Fastest way to replace chars

Short Perl/Tk example

how to create a two-dimensional array?

change file contents to a character array

What is "tieing"?

System calls with perl2exe

Perl 6

reading rc files

CPAN Module without root

New Perl release

What does "-e" mean in an if?

Sample Regular Expressions

Emailing from a Perl program

Testing if a variable is numeric

Use of this forum - ask those "odd" questions

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho