SQL: Identify & Remove Duplicate data from a SQL Table

Before we dive into the code, let us understand what is meant by Duplicate data first.

What is Duplicate Data?

Duplicate data, as the name suggests, refers to the presence of identical records in a table. OR
We label a row as duplicated when it is inserted into a table again, even though an identical entry already exists within that table

How to identify Duplicate data?

Duplicate rows are typically identified by referencing the primary or unique key column within the table.
However, in cases where the table lacks any key columns, it is recommended to examine all the columns to accurately pinpoint duplicate data.

For the demonstration purpose let's create a sales table with few records,

Here, To detect the duplicate rows we are considering SalesOrderID and SalesOrderLineItemID columns as unique Key columns and using the SQL Window function "ROW_NUMBER()".

How to Delete Duplicate Data?

To Delete the duplicate data, we use the unique behavior of "Common Table Expression (CTE)" in the SQL Server which is, if a CTE is created with only one table and any DML operations performed on that CTE will directly be reflected on to the underlying physical table. let us understand this with a demonstration.

Conclusion:

The process of identifying and eliminating duplicate data from an SQL table is made straightforward with the ROW_NUMBER() Window function. Alternatively, one can employ the GROUP BY clause to filter for unique records, although it's important to note that this approach creates a distinct dataset without physically deleting the data from the table

Please find below the complete code used in this blog post:



CREATE TABLE dbo.Sales
(
ID INT IDENTITY(1,1) NOT NULL,
SalesOrderID BIGINT NOT NULL,
SalesOrderLineItemID INT NOT NULL,
LineItemDesc NVARCHAR(250) NULL,
OrderedDate DATE NULL,
CustomerID NVARCHAR(250) NULL
) 

INSERT INTO dbo.Sales 
(SalesOrderID,SalesOrderLineItemID,LineItemDesc,OrderedDate,CustomerID)
VALUES
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10001,2,'Daal','2023-09-06','XYZ10'),
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10002,1,'CloseUP','2023-10-15','PQR25'),
(10002,2,'MouthWash','2023-10-15','PQR25'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99')

SELECT * FROM dbo.Sales
---------Detect Duplicate Rows
SELECT *, 
ROW_NUMBER() OVER(PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales

----------Delete Duplicate Rows
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales
)
DELETE FROM CTE WHERE RowNumber >1

SELECT * FROM dbo.Sales

Thanks for reading 🙏.

SQL: Identify & Remove Duplicate data from a SQL Table

Post a Comment

Popular Posts

PySpark Posts

MS SQL Posts

Home

SQL Puzzles

Contact form

SQL: Identify & Remove Duplicate data from a SQL Table

You Might Like

Post a Comment

Popular Posts

Contact form