How to get rid of duplicate data?? - The MySQL Relational Database

Posted by rhservice (rhservice), 11 July 2005

We are all pretty much database newbies here, but have learned a lot in a short time. However.............

We have a MySQL database of all our customers along with the details about each customer. There are currently a total of 28,629 records (rows) in the database. As of now all data is in 1 table with 29 fields. We want to break it out into 6 tables with each one holding a different range of the data, ie. Customers, Addresses, Misc Data, Rate Plans, Phone Models, and Sales Outlets as the table names. The problem is that when the data was originally imported from our accounting package, because of how it was entered all along and because of how the data exports, we now have many customers along with their addresses that are in the database multiple times with the the rest of the data for the customer being different for each instance. For example this customer may be in the database 4 times: Customer_John Q. Public - Address_123 N. Main St, - City_Anytown - State_Oh, from that point the rest of the data may be mostly or even completely different for each row that the customer is in. For example: different Rate Plans, different Phone Models, etc.

This has resulted in making it hard for us to break it out into the 6 separate tables and also makes it harder for the folks who are entering the data now as they are re-entering the customer names, addresses & etc when they get a new phone or rate plan in order to keep all of the existing data for historical reasons. They do not want to just change the current data to the new information since they frequently need to access the old information for several reasons.

The real question is: How can we get rid of all the duplicate Customer and address entries and still keep 1 entry for each customer still linked to all of the Customer info in the other tables. Then we can create a set of HTML pages with forms for entering data, searches, & etc. We want to be able to enter data for an existing customer on one form and have all the data go into the proper tables, then when the sales folks look up a customer they will get all of the details about that customer on one page or in one form instead of multiple rows because of the duplicate entries. We have not been able to come up with a good way to do this yet. The HTML pages should not be a problem, we just need to get the data broken down and get rid of the duplicates first.

Any help will be greatly appreciated and I can provide more info if needed and even sql files of the table structure as it is now and what we want to go to if needed.

Thank You, Larry

Posted by admin (Graham Ellis), 12 July 2005

Superb question .... and you're not the first to be going this route. May I paraphrase the question into slightly more general terms?

You have a single table database where some of the information in each row is repeated for a number of records. This is relatively quick and easy setup to do initially, but starts giving problems later on where you want to (for example) change the value of one of the fields that repeats, and you find you have to do it a number of times. You're at this stage now, where you want to "normalise" your database so that
* each piece of data is only there once
* there's only one piece of data in each cell
* calculated information is calculated each time an not held in the table
(I think the first issue there is the big one for you).

You haven't told me what language you use for handling the data on your site - I'm going to write an example in PHP (simply because I'm doing some PHP this morning and it will get me in the flow). Similar principle could be used for Perl or other languages. But you ARE going to have to write a program and you ARE going to have to modify all your existing applications to use the new table structures.

Approach:

Create your two (or more) new tables.
Write a piece of code to select all records.
Spray the data out to the two (or more) new tables, adding keys

I'm spending a bit of time putting an example toghether here as I suspect that the problem you have may be of interest to many others too. Here's some data I have made up and placed in a table

Code:

mysql> select * from tonw;
+-----------+-------+----------+------------+-----------+
| company | ttype | whenhere | goesto | whenthere |
+-----------+-------+----------+------------+-----------+
| wessex | train | 06:02 | Chippenham | 06:12 |
| wessex | train | 06:02 | Swindon | 06:30 |
| wessex | train | 08:05 | Chippenham | 08:16 |
| first | bus | 07:22 | Chippenham | 07:43 |
| first | bus | 08:05 | Chippenham | 08:45 |
| wessex | train | 08:05 | Swindon | 08:34 |
| faresaver | bus | 08:07 | Chippenham | 08:35 |
+-----------+-------+----------+------------+-----------+

It lists trains and buses from our town of Melksham to North West Wiltshire and for a single departure there may be more than one desitination listed.

Here's the code to split the tables:

Code:

#!/usr/bin/php -q

Normalising a database - demo

<?php

$common = array("company","ttype","whenhere");
$cid = 0; # Key to generate
$sofar = array();

mysql_connect("localhost","trainee","abc123");
mysql_select_db("test");
mysql_query ("create table common (company text, ttype text, whenhere text, cid int primary key not null auto_increment)");
mysql_query ("create table differ (goesto text, whenthere text, cid int, did int primary key not null auto_increment)");

$q = mysql_query("select * from tonw");

while ($row = mysql_fetch_assoc($q)) {
$superkey = "";
foreach ($common as $fld) {
$superkey .= ":".$row[$fld];
}
print ("$superkey\n");
if (! $sofar[$superkey]) {
$sofar[$superkey] = ++$cid;
$qout = "insert into common (company, ttype, whenhere, cid) values (\"$row[company]\", \"$row[ttype]\", \"$row[whenhere]\", $cid)";
mysql_query($qout);
print ("$qout\n");
}
$qout = "insert into differ (goesto, whenthere, cid) values (\"$row[goesto]\", \"$row[whenthere]\", $sofar[$superkey])";
mysql_query($qout);
print ("$qout\n");
}

?>

And here are the resulting tables ...

Code:

mysql> select * from differ;
+------------+-----------+------+-----+
| goesto | whenthere | cid | did |
+------------+-----------+------+-----+
| Chippenham | 06:12 | 1 | 1 |
| Swindon | 06:30 | 1 | 2 |
| Chippenham | 08:16 | 2 | 3 |
| Chippenham | 07:43 | 3 | 4 |
| Chippenham | 08:45 | 4 | 5 |
| Swindon | 08:34 | 2 | 6 |
| Chippenham | 08:35 | 5 | 7 |
+------------+-----------+------+-----+
7 rows in set (0.01 sec)

mysql> select * from common;
+-----------+-------+----------+-----+
| company | ttype | whenhere | cid |
+-----------+-------+----------+-----+
| wessex | train | 06:02 | 1 |
| wessex | train | 08:05 | 2 |
| first | bus | 07:22 | 3 |
| first | bus | 08:05 | 4 |
| faresaver | bus | 08:07 | 5 |
+-----------+-------+----------+-----+
5 rows in set (0.00 sec)

mysql>

If your data tables were huge, you would want to get the incoming MySQL query ordered to group them by "superkey" and save the need for the superkey array $sofar.

In order to make my code followable, I have used specific field and table names but in the more general case I would suggest you use arrays of field names to avoid to much repitition; indeed, it's a bit ironic that my "spike solution" here to your question about removing duplicated information has itself a lot of duplication contained in it!

I notice that you're talking names and addresses, and user entered information, by the way. That's may be highly unreliable when you come to matching up duplicated fields as someone may use capital letters one day and lower case the next, or may enter a comma in the middle of their address sometimes but not others. But we'll come back to this issue later if you indicate the need ...

Posted by rhservice (rhservice), 12 July 2005

Thanks for the reply. We use PHP here which we are pretty new at also but learning as we go so we should be able to handle it OK. As soon as I have the time (next day or so) I will use your example to start working on this and reply back if we hit a snag.

I don't know how much data most folks have but I think that over 29,600 rows with 29 fields in each row is quite a bit for we newbies to work with. When we work on this we always export the data to work on then when we are sure that all is well we import it back in with the changes meanwhile making sure noone is going to enter any new info while we are working on it.

Also if it will help, here is the current structure and what we are looking to go to...Feel free to criticize and offer any suggestions on how to do it better. We may be way off base here and not realize it !!!

CURRENT:
/*
MySQL Backup
Source Host: localhost
Source Server Version: 4.1.12
Source Database: cbtest1
Date: 2005/07/11 15:09:52
*/

use cbtest1;
SET FOREIGN_KEY_CHECKS=0;
#----------------------------
# Table structure for customer_list_revised_2
#----------------------------
CREATE TABLE `customer_list_revised_2` (
`Customer_Name` varchar(150) collate latin1_general_ci NOT NULL default '',
`Bus_Nm` varchar(150) collate latin1_general_ci default NULL,
`Address_Line1` varchar(150) collate latin1_general_ci default NULL,
`Address_Line2` varchar(150) collate latin1_general_ci default NULL,
`City_Name` varchar(150) collate latin1_general_ci default NULL,
`State_Cd` char(2) collate latin1_general_ci default NULL,
`Zip5_Cd` decimal(5,0) default NULL,
`Zip4_Cd` decimal(4,0) default NULL,
`Bus_Tel_Num` decimal(25,0) default NULL,
`Home_Tel_Num` decimal(25,0) default NULL,
`Tax_ID_Number` decimal(25,0) default NULL,
`Account_Number` decimal(25,0) default NULL,
`Mobile_Number` decimal(25,0) default NULL,
`Esn_Num` varchar(25) collate latin1_general_ci default NULL,
`New_ESN` varchar(25) collate latin1_general_ci default NULL,
`Cntrct_Eff_Dt` date default '0000-00-00',
`Sls_Outlet_Id` varchar(25) collate latin1_general_ci default NULL,
`Rate_Plan_Description` varchar(150) collate latin1_general_ci default NULL,
`Monthly_Access` decimal(10,2) default NULL,
`Term` decimal(2,0) default NULL,
`Contract_End_Date` date default '0000-00-00',
`Rate_Plan_Code` varchar(25) collate latin1_general_ci default NULL,
`Phone_Model` varchar(50) collate latin1_general_ci default NULL,
`Eqp_Mode` varchar(25) collate latin1_general_ci default NULL,
`Last_4_SS#` decimal(4,0) default NULL,
`Activating_Sales_person` varchar(50) collate latin1_general_ci default NULL,
`Comments` varchar(150) collate latin1_general_ci default NULL,
`Last_Update_Date` date default NULL,
`Index_Key` int(5) NOT NULL auto_increment,
PRIMARY KEY (`Index_Key`),
UNIQUE KEY `Index_Key` (`Index_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table customer_list_revised_2
#----------------------------

GOING TO: (The "Users Table" is where login info for the database will go)
/*
MySQL Backup
Source Host: localhost
Source Server Version: 4.1.12
Source Database: cb-test-2
Date: 2005/07/11 15:15:16
*/

use `cb-test-2`;
SET FOREIGN_KEY_CHECKS=0;
#----------------------------
# Table structure for addresses
#----------------------------
CREATE TABLE `addresses` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Address_Line_1` varchar(255) collate latin1_general_ci default NULL,
`Address_Line_2` varchar(255) collate latin1_general_ci default NULL,
`City_Name` varchar(255) collate latin1_general_ci default NULL,
`State_Code` varchar(255) collate latin1_general_ci default NULL,
`Zip_5_Code` varchar(255) collate latin1_general_ci default NULL,
`Zip_4_Code` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
KEY `address_key` (`Customer_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table addresses
#----------------------------

#----------------------------
# Table structure for customers
#----------------------------
CREATE TABLE `customers` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Customer_Name` varchar(255) collate latin1_general_ci default NULL,
`Business_Name` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
UNIQUE KEY `customer_key` (`Customer_Key`),
KEY `customer` (`Customer_Name`),
KEY `business` (`Business_Name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table customers
#----------------------------

#----------------------------
# Table structure for misc_data
#----------------------------
CREATE TABLE `misc_data` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Business_Tel_Number` varchar(255) collate latin1_general_ci default NULL,
`Home_Tel_Number` varchar(255) collate latin1_general_ci default NULL,
`Tax_ID_Number` varchar(255) collate latin1_general_ci default NULL,
`Account_Number` varchar(255) collate latin1_general_ci default NULL,
`Mobile_Number` varchar(255) collate latin1_general_ci default NULL,
`ESN_Number` varchar(255) collate latin1_general_ci default NULL,
`New_ESN_Number` varchar(255) collate latin1_general_ci default NULL,
`Last_4_SSN` varchar(255) collate latin1_general_ci default NULL,
`Comments` varchar(255) collate latin1_general_ci default NULL,
`Last_Update_Date` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
KEY `IDX_Misc_Data1` (`Customer_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table misc_data
#----------------------------

#----------------------------
# Table structure for outlets
#----------------------------
CREATE TABLE `outlets` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Outlet_ID` varchar(255) collate latin1_general_ci default NULL,
`Sales_Reps` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
KEY `IDX_Outlets1` (`Customer_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table outlets
#----------------------------

#----------------------------
# Table structure for phones
#----------------------------
CREATE TABLE `phones` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Phone_Model` varchar(255) collate latin1_general_ci default NULL,
`Equip_Mode` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
KEY `IDX_Phones1` (`Customer_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table phones
#----------------------------

#----------------------------
# Table structure for rate_plans
#----------------------------
CREATE TABLE `rate_plans` (
`Customer_Key` int(11) NOT NULL auto_increment,
`Rate_Plan_Description` varchar(255) collate latin1_general_ci default NULL,
`Monthly_Access` varchar(255) collate latin1_general_ci default NULL,
`Term` varchar(255) collate latin1_general_ci default NULL,
`Rate_Plan_Code` varchar(255) collate latin1_general_ci default NULL,
`Contract_Eff_Date` varchar(255) collate latin1_general_ci default NULL,
`Contract_End_Date` varchar(255) collate latin1_general_ci default NULL,
PRIMARY KEY (`Customer_Key`),
KEY `IDX_Rate_Plans1` (`Customer_Key`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table rate_plans
#----------------------------

#----------------------------
# Table structure for users_tab
#----------------------------
CREATE TABLE `users_tab` (
`ID_user` int(10) unsigned NOT NULL auto_increment,
`user_type_user` varchar(50) collate latin1_general_ci NOT NULL default '',
`username_user` varchar(50) collate latin1_general_ci NOT NULL default '',
`password_user` varchar(32) collate latin1_general_ci NOT NULL default '',
PRIMARY KEY (`ID_user`),
UNIQUE KEY `username_user` (`username_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
#----------------------------
# No records for table users_tab
#----------------------------

Posted by rhservice (rhservice), 23 August 2005

I thought I should take a few minutes to let you know that the script you posted here worked just fine. However we still were unable to get rid of all of the dupes because of this, and I quote you "someone may use capital letters one day and lower case the next, or may enter a comma in the middle of their address sometimes but not others". There are cases where a customers phone number was entered correctly one time and was entered with one number wrong another time, example - 4192296010 then 4192296011. Also names are misspelled, differences in address entries, and many more differences in records that are actually the same customer and should have identical info.

I was able to eliminate over 5000 dupes from the database that currently has a total of 29372 records before the dupes are removed. That helped a lot, but now I am trying to figure out how to eliminate the rest without someone having to go in and manually edit the rest of the duplicates to make them the same so I can rerun the script to remove the rest. I really don't know how many more duplicates there are but just from browsing through, my guess is that there may be another 5000 or more. It could be less or more than that. Any ideas?? Or are we stuck with someone having to edit them before we can get rid of them??

Thanks, Larry

Posted by admin (Graham Ellis), 24 August 2005

I'm thinking you could make good use of the levenshtein function in PHP which computes the "distance" between two strings - in other words, how many changes you have to make to get from one to the other.

Take pairs of customer records from your database, implode each of the pair into a single string, force them all to lower case as users never manage to enter correct case

and run the levenshtein. If you find that there are (say) less that 5 changes necessary to get from one string to the other, you've probably found a duplicate .... the number will need a bit of experimentation though and you'll probably find a proportion of special cases that it gets the wrong way ... for example two "John Smith"s living on the same block ...

This page is a thread posted to the opentalk forum at www.opentalk.org.uk and archived here for reference. To jump to the archive index please follow this link.