1SmartSolution Blog: Dealing With Duplicates

Posted At : Sep 22, 2010 14:05 PM | Posted By : Ed Tabara
Related Categories: SQL

Often there are situations where duplicate rows exist in a table and we need to get those that are unique or delete the duplicates. Earlier this year i had a post about Partitioning, now let's see how it works. In Microsoft SQL Server 2005 has been added the Row_Number() Over(Partition By...Order by...) feature and it can be used efficiently for such situations.

First let's build the scenario.
Create the table

create table Emp_Details 

(  Emp_Name varchar(10)

 , Company varchar(15)

 , Join_Date datetime

 , Resigned_Date datetime

)

go

and insert sample rows

insert into Emp_Details (Emp_Name, Company, Join_Date, Resigned_Date)

values ('John', 'Software', '20060101', '20061231')

,('John', 'Software', '20060101', '20061231')

,('John', 'Software', '20060101', '20061231')

,('John', 'SuperSoft', '20070101', '20071231')

,('John', 'UltraSoft', '20070201', '20080131')

,('John', 'ImproSoft', '20080201', '20081231')

,('John', 'ImproSoft', '20080201', '20081231')

,('Mary', 'Software', '20060101', '20081231')

,('Mary', 'SuperSoft', '20090101', '20090531')

,('Mary', 'SuperSoft', '20090101', '20090531')

,('Mary', 'UltraSoft', '20090601', '20100531')

,('Mary', 'UltraSoft', '20090601', '20100531')

So, what effect would have using Row_Number() Over() with the Partition By clause? The Row_Number() Over() function is looking for rows with the same values of Emp_Name, Company, Join_Date and Resigned_Date columns in the Emp_Details table. The first occurrence of this combination of columns is being allocated a RowNumber=1. The subsequent occurrences of the same combination of data are being allocated RowNumber of 2, 3, etc. When a new combination of Emp_Name, Company, Join_Date and Resigned_Date columns is encountered, that set is treated as a new partition and the RowNumber starts from 1 again thanks to the Partition By clause. In essence, the columns in the Partition By clause are being grouped together as per the Partition By clause and then ordered using the Order By clause:

select Emp_Name

      ,Company

      ,Join_Date

      ,Resigned_Date

      ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

                         ,Resigned_Date

                          order by Emp_Name, Company, Join_Date

                         ,Resigned_Date) RowNumber 

from Emp_Details

And the result is:

Emp_Name Company      Join_Date      Resigned_Date  RowNumber

-------- ------------ -------------- -------------- ----------

John     ImproSoft    2008-02-01     2008-12-31     1

John     ImproSoft    2008-02-01     2008-12-31     2

John     Software     2006-01-01     2006-12-31     1

John     Software     2006-01-01     2006-12-31     2

John     Software     2006-01-01     2006-12-31     3

John     SuperSoft    2007-01-01     2007-12-31     1

John     UltraSoft    2007-02-01     2008-01-31     1

Mary     Software     2006-01-01     2008-12-31     1

Mary     SuperSoft    2009-01-01     2009-05-31     1

Mary     SuperSoft    2009-01-01     2009-05-31     2

Mary     UltraSoft    2009-06-01     2010-05-31     1

Mary     UltraSoft    2009-06-01     2010-05-31     2

If we need to get unique rows:

select a.Emp_Name, a.Company, a.Join_Date, a.Resigned_Date, a.RowNumber

from

(select Emp_Name

 ,Company

 ,Join_Date

 ,Resigned_Date

 ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

 ,Resigned_Date

 order by Emp_Name, Company, Join_Date

 ,Resigned_Date) RowNumber 

from Emp_Details) a

where a.RowNumber = 1

To see the rows with duplicates:

select a.Emp_Name, a.Company, a.Join_Date, a.Resigned_Date, a.RowNumber

from

(select Emp_Name

 ,Company

 ,Join_Date

 ,Resigned_Date

 ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

 ,Resigned_Date

 order by Emp_Name, Company, Join_Date

 ,Resigned_Date) RowNumber 

from Emp_Details) a

where a.RowNumber > 1

And finally to remove the duplicates:

delete from a

from

(select Emp_Name, Company, Join_Date, Resigned_Date

       ,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date

                          ,Resigned_Date 

                           order by Emp_Name, Company, Join_Date

                          ,Resigned_Date) RowNumber 

from Emp_Details) a

where a.RowNumber > 1

Enjoy!

| 3538 Views | 4% / 0% Popularity

Related entries:
Data Access Optimization In SQL Server: Partitioning

Comments

Search

Calendar

Sun	Mon	Tue	Wed	Thu	Fri	Sat
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags Cloud

1ssblog about access adobe ajax amazon caching cfhssf cfsqlmaster cfwatcher coldfusion data deals fun jquery moldova mssql my projects optimization other performance profit ria server sites sql this transerfing using your

ColdFusion	166	[RSS]
Other	101	[RSS]
My Projects	61	[RSS]
Fun	54	[RSS]
SQL	52	[RSS]
Deals	19	[RSS]
RIA	18	[RSS]
1ssBlog	16	[RSS]
cfSQLMaster	12	[RSS]
Profit	9	[RSS]
Caching	8	[RSS]
Transerfing	6	[RSS]
AJAX	5	[RSS]
cfHSSF	5	[RSS]
cfWatcher	5	[RSS]
JavaScript	5	[RSS]
Amazon	4	[RSS]
cfFirewall	4	[RSS]
Security	4	[RSS]
SEO	4	[RSS]
jQuery	3	[RSS]
1ssChat	2	[RSS]
Adobe Air	1	[RSS]
jQuery Mobile	1	[RSS]
MMA	1	[RSS]

Categories

Recent Entries

Recent Comments

Search

Calendar

Tags Cloud