SQL: Identify & Remove Duplicate data from a SQL Table

Before we dive into the code, let us understand what is meant by Duplicate data first.

What is Duplicate Data?

  • Duplicate data, as the name suggests, refers to the presence of identical records in a table. OR
  • We label a row as duplicated when it is inserted into a table again, even though an identical entry already exists within that table
How to identify Duplicate data?
  • Duplicate rows are typically identified by referencing the primary or unique key column within the table. 
  • However, in cases where the table lacks any key columns, it is recommended to examine all the columns to accurately pinpoint duplicate data.
For the demonstration purpose let's create a sales table with few records,


















Here, To detect the duplicate rows we are considering SalesOrderID and SalesOrderLineItemID columns as unique Key columns and using the SQL Window function "ROW_NUMBER()".












How to Delete Duplicate Data?

To Delete the duplicate data, we use the unique behavior of "Common Table Expression (CTE)" in the SQL Server which is, if a CTE is created with only one table and any DML operations performed on that CTE will directly be reflected on to the underlying physical table. let us understand this with a demonstration.













Conclusion: 

The process of identifying and eliminating duplicate data from an SQL table is made straightforward with the ROW_NUMBER() Window function. Alternatively, one can employ the GROUP BY clause to filter for unique records, although it's important to note that this approach creates a distinct dataset without physically deleting the data from the table

Please find below the complete code used in this blog post:


CREATE TABLE dbo.Sales
(
ID INT IDENTITY(1,1) NOT NULL,
SalesOrderID BIGINT NOT NULL,
SalesOrderLineItemID INT NOT NULL,
LineItemDesc NVARCHAR(250) NULL,
OrderedDate DATE NULL,
CustomerID NVARCHAR(250) NULL
) 

INSERT INTO dbo.Sales 
(SalesOrderID,SalesOrderLineItemID,LineItemDesc,OrderedDate,CustomerID)
VALUES
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10001,2,'Daal','2023-09-06','XYZ10'),
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10002,1,'CloseUP','2023-10-15','PQR25'),
(10002,2,'MouthWash','2023-10-15','PQR25'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99')

SELECT * FROM dbo.Sales
---------Detect Duplicate Rows
SELECT *, 
ROW_NUMBER() OVER(PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales

----------Delete Duplicate Rows
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales
)
DELETE FROM CTE WHERE RowNumber >1

SELECT * FROM dbo.Sales

Thanks for reading 🙏.



No comments:

Post a Comment