removing and counting duplicates in file

Posted by hjortur (hjortur), 4 November 2007

Hi. I'm new to Perl and I am trying to write a code that takes a file and removes any duplicates, then writes the number of occurring duplicates.
I am trying to combine my knowlege in C with Perl to do this, but I am unfamiliar with the format and have been unsuccessfull in finding any clues on how to do this.

The file to be processed would look something like this:

192.168.1.1
192.168.1.1
200.1.1.2
201.43.43.1
10.0.0.1
200.1.1.2
192.168.1.1

The resulting output would be...

3 192.168.1.1
2 200.1.1.2
1 201.43.43.1
1 10.0.0.1
1 201.43.43.1

Tks for any assistance...

Hjortur..

Posted by admin (Graham Ellis), 4 November 2007

You probably want to go "beyond" C and look at Perl's hashes.

Have a look at:

http://www.wellho.net/resources/ex.php4?item=p211/web_count

which is the answer to a practical exercise on our Perl Programming Course. We ask the delegates to read an access log file of many thousand lines, and report on the number of times each client (identified by a unique host name in the first column) has visited us ...

Posted by KevinAD (KevinAD), 5 November 2007

This is a very simple task for perl.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $file = '/path/to/file.txt';
my %seen = ();
{
local @ARGV = ($file);
local $^I = '.bac';
while(<>){
$seen{$_}++;
next if $seen{$_} > 1;
print;
}
}
foreach my $keys ( sort {$seen{$b} <=> $seen{$a}} keys %seen) {
print "$keys = $seen{$keys}\n";
}

Posted by george_Ball (george), 5 November 2007

Are you running on Unix/Linux? If so, then I'd suggest you don't bother with Perl, and use straight Unix:

sort file | uniq -c

which will generate exactly the output you want.

Of course if you aren't using *x then... well, sorry for you!!

Posted by KevinAD (KevinAD), 5 November 2007

But if a person is trying to learn perl using nix commands is not going to help to learn perl. On the other hand, if all they want is to accomplish the task, that is a good suggestion.

Posted by hjortur (hjortur), 5 November 2007

Tks George

This worked...! I was trying to write the code using basicly the same
techniques as with C so the code was rather ugly....with nested for loops...
As I am new to perl, I am not exactly sure how this works..
for example : local $^I = '.bac';
I have been reading up on perl and was trying to program associative arrays...I guess this is related...
Tks again..
Hjortur

Posted by george_Ball (george), 5 November 2007

javascript:embarassed()

Sorry, I didnt think when I posted and just assumed you had the problem to solve whatever way...

The solution that Kevin has posted is, as you have probably worked out, the best way to do this from Perl - one of the things I find continually when I am teaching Perl is that people don't appreciate just how much hashes can do for you, with problems like this being the perfect example of how they can save you work.

Happy hacking!!

Posted by KevinAD (KevinAD), 5 November 2007

on 11/05/07 at 17:23:57, hjortur wrote:

Tks George

"$^I" is a perl variable. Perl has many predefined variables that affect the way perl works. Here is the list of perl 5.8.8 variables:

http://perldoc.perl.org/perlvar.html

"$^I" tells perl to use the inplace (or streaming) editor. It has nothing to do with associative arrays though. "local" tells perl to use a temporary value for the variable you declare with "local" to the enclosing block it's used in. This way it does not globally affect your perl program. Once the block is exited perl restores the old value to the variable. Using "local" is just a good habit to get into when using perls predefined variables in your perl programs.

The associative array (hash) is %seen which is used to find duplicates and remove them from the file. Because hash keys must be unique, they are well suited for finding duplicate data in files as well as other things.

Posted by hjortur (hjortur), 5 November 2007

Tks KevinAD

This code is very short and straight to the point...looking forward to learning Perl... it seems simple...but..
What does the "next if $seen{$_} > 1;" do by the way?
and how are the duplicates deleted? It is a bit hard to grasp...

H

Posted by KevinAD (KevinAD), 6 November 2007

the hash %seen is counting how many times each line is found in the file:

$seen{$_}++;

If the quantity is greater than one the line is skipped:

next if $seen{$_} > 1;

"next" is a loop control. It tells perl to jump to the next iteration of the loop immediately. In this case it's the "while" loop.

So the "print" line is never evaluated if $seen{$_} is greater than 1 ($seen{$_} > 1) and that effectively deletes that line from the file.

Posted by hjortur (hjortur), 6 November 2007

Great

Tks alot..!

Now I can continue experimenting....

Good to know I can get such a great help here if
I get stuck...!

H

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.