Training, Open Source computer languages
PerlPHPPythonMySQLApache / TomcatTclRubyJavaC and C++LinuxCSS 
Search for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
 
For 2023 (and 2024 ...) - we are now fully retired from IT training.
We have made many, many friends over 25 years of teaching about Python, Tcl, Perl, PHP, Lua, Java, C and C++ - and MySQL, Linux and Solaris/SunOS too. Our training notes are now very much out of date, but due to upward compatability most of our examples remain operational and even relevant ad you are welcome to make us if them "as seen" and at your own risk.

Lisa and I (Graham) now live in what was our training centre in Melksham - happy to meet with former delegates here - but do check ahead before coming round. We are far from inactive - rather, enjoying the times that we are retired but still healthy enough in mind and body to be active!

I am also active in many other area and still look after a lot of web sites - you can find an index ((here))
Removing duplicate lines with a regular expression

Posted by enquirer (enquirer), 5 August 2003
I found your email in the site www.regularexpression.info

I need a regular expression to remove the duplicate lines in my code. Please help me.

example code :

Code:
edit(abcd);
edit(abcd);
edit(1234);


And it should be

Code:
edit(abcd);
edit(1234);


I need a regular expression for this.


Posted by admin (Graham Ellis), 5 August 2003
What language did you want to use?   Here's a possible solution in Perl

Code:
#!/usr/bin/perl
open (FH,$ARGV[0]) or die ("No input file\n");
read (FH,$file,-s $ARGV[0]);
$file =~ s/^(.*[\n\r]+)\1+/$1/mg;
print $file;


Regular expressions are REALLY powerful - it's worth learning how to
write them yourself!

-- Technical note -- Nice example of using back references to previous matches (\1, etc) and also the "m" modifier to have ^ match at the embedded start of  any new line. .

Posted by John_Moylan (jfp), 11 August 2003
I know you asked for a regular expression...but...Another way to do this would be to use a hash. (or associative array in other languages)

loop through your line entries as an array and put them in a hash as the key.
A hash can only have one unique key and will clobber any duplicates with itself.

Code:
my %uniques;

foreach my $key(@array_of_lines){

   $uniques{$key}++;
}


The hash %uniques now holds only unique values as keys, the value will be a count of the amount of duplicates for that particular line.

I'm sure I'll be corrected but this should be faster than a regex?

jfp

Posted by admin (Graham Ellis), 15 August 2003
Yes, that will indeed remove duplicate lines, but in a rather different way:

a) With a regular expression, duplicate lines are only removed if they follow one another directly; if a line occurs early in the file then happens to crop up again much later, the later copy will not be removed.

b) With a regular expression, the order of the lines is maintained.  Using a hash you can certainly output each unique line, but if they have to be in the same order as in the original input file, that's going to be an interesting task!

Looking back at the original question - I'm really not sure which the problem was - the one that's solved with a hash, or the one that's solved with a regular expression.   My original enquirer has written back "thanks for the reply", which doesn't give me any clue either; I guess we'll never know what the customer wanted  



This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.

You can Add a comment or ranking to this page

© WELL HOUSE CONSULTANTS LTD., 2024: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 01144 1225 708225 • FAX: 01144 1225 793803 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho