What is Duplicate Data?
- Duplicate data, as the name suggests, refers to the presence of identical records in a table. OR
- We label a row as duplicated when it is inserted into a table again, even though an identical entry already exists within that table
How to identify Duplicate data?
- Duplicate rows are typically identified by referencing the primary or unique key column within the table.
- However, in cases where the table lacks any key columns, it is recommended to examine all the columns to accurately pinpoint duplicate data.
For the demonstration purpose let's create a sales table with few records,
Here, To detect the duplicate rows we are considering SalesOrderID and SalesOrderLineItemID columns as unique Key columns and using the SQL Window function "ROW_NUMBER()".
To Delete the duplicate data, we use the unique behavior of "Common Table Expression (CTE)" in the SQL Server which is, if a CTE is created with only one table and any DML operations performed on that CTE will directly be reflected on to the underlying physical table. let us understand this with a demonstration.
Conclusion:
The process of identifying and eliminating duplicate data from an SQL table is made straightforward with the ROW_NUMBER() Window function. Alternatively, one can employ the GROUP BY clause to filter for unique records, although it's important to note that this approach creates a distinct dataset without physically deleting the data from the table
Please find below the complete code used in this blog post:
CREATE TABLE dbo.Sales
(
ID INT IDENTITY(1,1) NOT NULL,
SalesOrderID BIGINT NOT NULL,
SalesOrderLineItemID INT NOT NULL,
LineItemDesc NVARCHAR(250) NULL,
OrderedDate DATE NULL,
CustomerID NVARCHAR(250) NULL
)
INSERT INTO dbo.Sales
(SalesOrderID,SalesOrderLineItemID,LineItemDesc,OrderedDate,CustomerID)
VALUES
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10001,2,'Daal','2023-09-06','XYZ10'),
(10001,1,'Sugar','2023-09-06','XYZ10'),
(10002,1,'CloseUP','2023-10-15','PQR25'),
(10002,2,'MouthWash','2023-10-15','PQR25'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99'),
(10003,1,'USBCable','2023-01-25','ABC99')
SELECT * FROM dbo.Sales
---------Detect Duplicate Rows
SELECT *,
ROW_NUMBER() OVER(PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales
----------Delete Duplicate Rows
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY SalesOrderID, SalesOrderLineItemID ORDER BY ID ASC) AS RowNumber
FROM dbo.Sales
)
DELETE FROM CTE WHERE RowNumber >1
SELECT * FROM dbo.Sales
Thanks for reading 🙏.
No comments:
Post a Comment