Comment on page
Lesson 52. Finding Duplicate Records In A Table
Sometimes data in a data warehouse can get duplicated. There are a lot of reasons for this, including but not limited to:
The data pull may have a join that makes duplicating data unavoidable. There may be an error in the source system. There might be duplicate data in a staging table, and nobody wrote a check for that. There might be no master data management in place.
Whatever the reason, having duplicate data in your data warehouse means your numbers may not add up right. The best place to get rid of dups is in the ETL process, but sometimes you find mistakes in data after you load it to prod. When you are confident in your code, and you are sure any duplicates are in fact duplicate records from a business perspective, you will need a way to get rid of them. The technique below utilizes the ROW_NUMBER() function.
This example was developed and tested on SQL Server 2016 Development Edition.
Extensive Duplicate Record Removal Process
Let’s create our tables. We’re going to create two tables: one fact table and one dimension table.
Run the following script.
In [ ]:
USE demo
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.FactProductSales') AND type in (N'U'))
ALTER TABLE dbo.FactProductSales DROP CONSTRAINT IF EXISTS FK_ProductSales_Products
GO
DROP TABLE IF EXISTS dbo.FactProductSales
GO
--Create our product table
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.DimProducts') AND type in (N'U'))
ALTER TABLE dbo.DimProducts DROP CONSTRAINT IF EXISTS FK_Products_Products
GO