Tutorials
Practical T-SQL
Practical T-SQL
  • Practical T-SQL Pocket Guide For Beginners
  • Preface
    • Section I. A Note From The Author
    • Section II. Tutorial Overview
    • Section III. Running The Examples
    • Section IV. How To Report An Issue
    • Section V. Join The MSU Community On Discord
    • Section VI. Supplementary Material
  • Language Basics
    • Lesson 1. Obligatory Hello World
    • Lesson 2. Code Comments With T-SQL
    • Lesson 3. Basic Syntax
    • Lesson 4. Your First Query
    • Lesson 5. Filtering Data
    • Lesson 6. Sorting Data
    • Lesson 7. Complex Data Filtering
    • Lesson 8. Aliases
    • Lesson 9. String Functions
    • Lesson 10. Creating New Columns From Existing Data (Calculated Fields)
    • Lesson 11. Displaying Data Based On Conditions (Case Statement)
    • Lesson 12. Aggregate Functions
    • Lesson 13. Grouping And Summarizing Data
    • Lesson 14. Querying More Than One Table
    • Lesson 15. Combining Queries
    • Lesson 16. Subqueries
    • Lesson 17. Creating Data
    • Lesson 18. Updating Data
    • Lesson 19. Deleting Data
    • Lesson 20. Common Table Expressions (CTEs)
    • Lesson 21. Derived Tables
    • Lesson 22. Putting It All Together
  • Advanced Topics
    • Lesson 23. Selecting Unique Values
    • Lesson 24. Updating Data With A Join
    • Lesson 25. Data Types
    • Lesson 26. Casting Data Types
    • Lesson 27. Creating Tables
    • Lesson 28. Altering Tables
    • Lesson 29. Dropping Tables
    • Lesson 30. Variables
    • Lesson 31. Controlling Flow
    • Lesson 32. Looping
    • Lesson 33. Error Processing
    • Lesson 34. Temporary Tables
    • Lesson 35. Views
    • Lesson 36. Indexed Views
    • Lesson 37. User Defined Functions
    • Lesson 38. Stored Procedures
    • Lesson 39. BULK INSERT
    • Lesson 40. Loading Tables With MERGE
    • Lesson 41. Partitioning A Dataset
    • Lesson 42. Pivoting Data
    • Lesson 43. Dynamic SQL
    • Lesson 44. Cursors
  • Solutions To Real World Problems
    • Lesson 45. Listing All Tables In A SQL Server Database
    • Lesson 46. Listing All Columns In A SQL Server Database
    • Lesson 47. Pull Records From A Table At Random
    • Lesson 48. A Better Alternative To WITH (NOLOCK)
    • Lesson 49. Boost Performance When Calling A Stored Proc From SSIS
    • Lesson 50. Setting Up Queries For Ablation Testing
    • Lesson 51. Reduce Code And Save Time With Default Column Values
    • Lesson 52. Finding Duplicate Records In A Table
    • Lesson 53. Why You Cannot Have More Than One Clustered Index On A Table
    • Lesson 54. Converting Dates To YYYYMMDD
    • Lesson 55. Sending Notification Emails With T-SQL Without Using Hardcoded Email Addresses
    • Lesson 56. Troubleshooting Long Running Queries
    • Lesson 57. Loading Large CSVs Into Data Warehouse Staging Tables
    • Lesson 58. The Only Bloody Good Reason To Use Cursors
    • Lesson 59. Loading A Type II Slowly Changing Dimension With SQL Merge
    • Lesson 60. A Clearer Explanation Of The Parameters Of The Numeric Data Type
    • Lesson 61. Why You Cannot Join On Null Values
    • Lesson 62. A Deep Dive On How The Where Clause Functions
    • Lesson 63. Using HASHBYTES() To Compare Character Strings
    • Lesson 64. Using Pipe To Hash Multiple Columns For Matching
    • Lesson 65. Why People That Indent Code Drive Me Nuts
    • Lesson 66. How To Rapidly Stand Up A Data Warehouse From Scratch
    • Lesson 67. How To Pivot Data With T-SQL When Columns Are Not Predefined
    • Lesson 68. Prepopulating A Junk Dimension
Powered by GitBook
On this page
  1. Solutions To Real World Problems

Lesson 52. Finding Duplicate Records In A Table

Sometimes data in a data warehouse can get duplicated. There are a lot of reasons for this, including but not limited to:

The data pull may have a join that makes duplicating data unavoidable. There may be an error in the source system. There might be duplicate data in a staging table, and nobody wrote a check for that. There might be no master data management in place.

Whatever the reason, having duplicate data in your data warehouse means your numbers may not add up right. The best place to get rid of dups is in the ETL process, but sometimes you find mistakes in data after you load it to prod. When you are confident in your code, and you are sure any duplicates are in fact duplicate records from a business perspective, you will need a way to get rid of them. The technique below utilizes the ROW_NUMBER() function.

This example was developed and tested on SQL Server 2016 Development Edition.

Examples

Extensive Duplicate Record Removal Process

Let’s create our tables. We’re going to create two tables: one fact table and one dimension table.

Run the following script.

In [ ]:

USE demo

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.FactProductSales') AND type in (N'U'))
ALTER TABLE dbo.FactProductSales DROP CONSTRAINT IF EXISTS FK_ProductSales_Products
GO

DROP TABLE IF EXISTS dbo.FactProductSales
GO

--Create our product table
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.DimProducts') AND type in (N'U'))
ALTER TABLE dbo.DimProducts DROP CONSTRAINT IF EXISTS FK_Products_Products
GO

DROP TABLE IF EXISTS dbo.DimProducts
GO


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.DimProducts') AND type in (N'U'))
BEGIN
CREATE TABLE dbo.DimProducts(
ProductID bigint IDENTITY(1,1) NOT NULL,
ProductName nvarchar(50) NULL,
 CONSTRAINT PK_Products PRIMARY KEY CLUSTERED 
(
	ProductID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO

IF NOT EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_Products_Products') AND parent_object_id = OBJECT_ID(N'dbo.DimProducts'))
ALTER TABLE dbo.DimProducts  WITH CHECK ADD  CONSTRAINT FK_Products_Products FOREIGN KEY(ProductID)
REFERENCES dbo.DimProducts (ProductID)
GO

IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_Products_Products') AND parent_object_id = OBJECT_ID(N'dbo.DimProducts'))
ALTER TABLE dbo.DimProducts CHECK CONSTRAINT FK_Products_Products
GO


--Create our product sales table
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.FactProductSales') AND type in (N'U'))
BEGIN
CREATE TABLE dbo.FactProductSales(
InvoiceID bigint IDENTITY(1,1) NOT NULL,
ProductID bigint NOT NULL,
Price money NOT NULL,
ItemCount int NOT NULL,
InvoiceTotal  AS (Price*ItemCount),
 CONSTRAINT PK_ProductSales PRIMARY KEY CLUSTERED 
(
	InvoiceID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO

IF NOT EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_ProductSales_Products') AND parent_object_id = OBJECT_ID(N'dbo.FactProductSales'))
ALTER TABLE dbo.FactProductSales  WITH CHECK ADD  CONSTRAINT FK_ProductSales_Products FOREIGN KEY(ProductID)
REFERENCES dbo.DimProducts (ProductID)
GO

IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_ProductSales_Products') AND parent_object_id = OBJECT_ID(N'dbo.FactProductSales'))
ALTER TABLE dbo.FactProductSales CHECK CONSTRAINT FK_ProductSales_Products
GO

Expand the database in the object explorer and take a look at the tables that got created. One table called DimProducts holds our products and the other table called FactProductSales holds the sales transactions of those items.

Now let’s create some sample data. Run this script.

In [ ]:

USE demo

INSERT INTO DimProducts(ProductName)
SELECT 'Microsoft Office 365'
UNION ALL
SELECT 'Microsoft Access 2013'
UNION ALL
SELECT 'Microsoft Access 2013'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft Visual Studio 2013'
UNION ALL
SELECT 'Microsoft Visual Studio 2013'

--let's check our work
SELECT * FROM DimProducts

--Now let's load some sales
DECLARE @TempSales TABLE (Price MONEY, ItemCount INT)
INSERT INTO @TempSales
SELECT 300.00, 5
UNION
SELECT 520.00, 10

--Each record in @TempSales doesn't represent a transaction.
--We just want to have more than one invoice for a particular
--product.
INSERT INTO FactProductSales(ProductID, Price, ItemCount)
SELECT p.ProductID, ts.Price, ts.ItemCount
FROM DimProducts p
CROSS JOIN @TempSales ts

--let's check our work
SELECT * FROM FactProductSales ORDER BY ProductID

--And with the join
SELECT p.ProductName, ps.ProductID, ps.InvoiceID AS InvoiceNumber, ps.Price, ps.ItemCount, ps.InvoiceTotal
FROM DimProducts p
JOIN FactProductSales ps
ON p.ProductID = ps.ProductID
ORDER BY p.ProductName

This generates three outputs so you can see the results of the table load operation. The first output is from the products dimension table where you can see that there are duplicates in the data. The second is from the product sales fact table. If you’re sharp, you can see that there are fact records attached to the duplicate dimension records. The final query shows the results of a join between the two tables where you can again see the impact of having duplicate dimension records.

Since we know that there are duplicate records in the dimension table, we now have to get rid of those records, but we cannot do it by simply deleting them. First, we have to reassign those fact records to the prime dimension record, then we can blow away the unnecessary dimension records. Picking which dimension record is the prime dimension record is complex and beyond the scope of this tutorial. For now, we’ll just choose an arbitrary record and call that the prime record.

There are two key SQL objects that we’re going to use to make this change happen: CTEs, and the ROW_NUMBER() function. These are both complex objects and merit further discussion. For now, just go on faith and follow this example. When you get a minute, you can read the TSQL documentation. CTEs are fairly common, but the ROW_NUMBER() function might be new to you.

The process of de-duplication works like this. First, we group and number the records in the dimension table. The numbering will tell you if there is more than one instance of a specific record. We use ROW_NUMBER() to assign ordinal values to the dimension records and group the counting by ProductName. We load these records into a table variable, so they are easy to work with. Any records with a Row Number value of 1 are original records. Any record with a Row Number value greater than 1 is a duplicate.

The next step is to pair duplicate record IDs with prime record IDs. We do this so we can reassign the duplicate record IDs in the fact table to the proper prime dimension record. We do that by being clever with some table variables, a CTE, and a join.

Now we can finally fix our tables. From here, it’s a pretty straight-forward process to update our fact table with the proper IDs. Once that is done, we’re clear to blow away our duplicates in our dimension table. Run the script below.

In [ ]:

USE demo

--Looks peachy but we know some of those product IDs are bogus.
--Let's fix it.

--First we need keep track of the dups
--If you have a large amount of dups, use a
--temp table instead.
DECLARE @DuplicateProductIDs TABLE(RowNumber INT, ProductID BIGINT, ProductName NVARCHAR(50))

INSERT INTO @DuplicateProductIDs(RowNumber, ProductID, ProductName)
SELECT ROW_NUMBER() OVER (PARTITION BY ProductName ORDER BY ProductName) AS RowNumber, ProductID, ProductName
FROM DimProducts

SELECT * FROM @DuplicateProductIDs

--Let's fix our child table so the records point to the right product
--First we need to transform the Dup ID info so we know which dup ID
--belongs to the original ID
DECLARE @RealProductIDsWithDupes TABLE(OriginalProductID BIGINT, DuplicateProductID BIGINT, ProductName NVARCHAR(50))

;
WITH OriginalProductIDs(OriginalProductID, ProductName)
AS(
SELECT ProductID, ProductName
FROM @DuplicateProductIDs
WHERE RowNumber = 1
),
DuplicateProductIDs(DuplicateProductID, ProductName)
AS(
SELECT ProductID, ProductName
FROM @DuplicateProductIDs
WHERE RowNumber <> 1
)
INSERT INTO @RealProductIDsWithDupes(OriginalProductID, DuplicateProductID, ProductName)
SELECT opid.OriginalProductID, dpid.DuplicateProductID, opid.ProductName
FROM OriginalProductIDs opid
JOIN DuplicateProductIDs dpid
ON opid.ProductName = dpid.ProductName

--check
SELECT * FROM @RealProductIDsWithDupes

--Now let's fix our child table

--Before
SELECT * FROM FactProductSales ORDER BY ProductID

UPDATE ps
SET ps.ProductID = rpids.OriginalProductID
FROM FactProductSales ps
JOIN @RealProductIDsWithDupes rpids
ON rpids.DuplicateProductID = ps.ProductID

--After
--Number of rows should not have changed but the
--number of distinct product IDs should be reduced.
SELECT * FROM FactProductSales ORDER BY ProductID

--Now that we won't violate any FK restraints,
--we can safely blow away dups in the parent

DELETE 
FROM DimProducts
WHERE ProductID IN(
SELECT ProductID
FROM @DuplicateProductIDs
WHERE RowNumber <> 1
)

--Check
SELECT * FROM DimProducts

--Final result
SELECT ps.InvoiceID AS InvoiceNumber, ps.ProductID, p.ProductName,  ps.Price, ps.ItemCount, ps.InvoiceTotal
FROM DimProducts p
JOIN FactProductSales ps
ON p.ProductID = ps.ProductID
ORDER BY p.ProductID


--Clean up your mess
--DROP TABLE FactProductSales
--DROP TABLE DimProducts

This was a complex example. Ideally, you want to get rid of duplicate records early in the pipeline. The most common case I find is getting rid of them in the staging tables. However, this is not always possible.

The techniques taught here can be generalized to other use cases. Below is the example code in its entirety.

In [ ]:

USE demo

--Create our product table
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.DimProducts') AND type in (N'U'))
ALTER TABLE dbo.DimProducts DROP CONSTRAINT IF EXISTS FK_Products_Products
GO

DROP TABLE IF EXISTS dbo.DimProducts
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.DimProducts') AND type in (N'U'))
BEGIN
CREATE TABLE dbo.DimProducts(
ProductID bigint IDENTITY(1,1) NOT NULL,
ProductName nvarchar(50) NULL,
 CONSTRAINT PK_Products PRIMARY KEY CLUSTERED 
(
	ProductID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO

IF NOT EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_Products_Products') AND parent_object_id = OBJECT_ID(N'dbo.DimProducts'))
ALTER TABLE dbo.DimProducts  WITH CHECK ADD  CONSTRAINT FK_Products_Products FOREIGN KEY(ProductID)
REFERENCES dbo.DimProducts (ProductID)
GO

IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_Products_Products') AND parent_object_id = OBJECT_ID(N'dbo.DimProducts'))
ALTER TABLE dbo.DimProducts CHECK CONSTRAINT FK_Products_Products
GO


--Create our product sales table
GO

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.FactProductSales') AND type in (N'U'))
ALTER TABLE dbo.FactProductSales DROP CONSTRAINT IF EXISTS FK_ProductSales_Products
GO

DROP TABLE IF EXISTS dbo.FactProductSales
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.FactProductSales') AND type in (N'U'))
BEGIN
CREATE TABLE dbo.FactProductSales(
InvoiceID bigint IDENTITY(1,1) NOT NULL,
ProductID bigint NOT NULL,
Price money NOT NULL,
ItemCount int NOT NULL,
InvoiceTotal  AS (Price*ItemCount),
 CONSTRAINT PK_ProductSales PRIMARY KEY CLUSTERED 
(
	InvoiceID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO

IF NOT EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_ProductSales_Products') AND parent_object_id = OBJECT_ID(N'dbo.FactProductSales'))
ALTER TABLE dbo.FactProductSales  WITH CHECK ADD  CONSTRAINT FK_ProductSales_Products FOREIGN KEY(ProductID)
REFERENCES dbo.DimProducts (ProductID)
GO

IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'dbo.FK_ProductSales_Products') AND parent_object_id = OBJECT_ID(N'dbo.FactProductSales'))
ALTER TABLE dbo.FactProductSales CHECK CONSTRAINT FK_ProductSales_Products
GO

INSERT INTO DimProducts(ProductName)
SELECT 'Microsoft Office 365'
UNION ALL
SELECT 'Microsoft Access 2013'
UNION ALL
SELECT 'Microsoft Access 2013'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft SQL Server 2012'
UNION ALL
SELECT 'Microsoft Visual Studio 2013'
UNION ALL
SELECT 'Microsoft Visual Studio 2013'

--let's check our work
SELECT * FROM DimProducts

--Now let's load some sales
DECLARE @TempSales TABLE (Price MONEY, ItemCount INT)
INSERT INTO @TempSales
SELECT 300.00, 5
UNION
SELECT 520.00, 10

--Each record in @TempSales doesn't represent a transaction.
--We just want to have more than one invoice for a particular
--product.
INSERT INTO FactProductSales(ProductID, Price, ItemCount)
SELECT p.ProductID, ts.Price, ts.ItemCount
FROM DimProducts p
CROSS JOIN @TempSales ts

--let's check our work
SELECT * FROM FactProductSales ORDER BY ProductID

--And with the join
SELECT p.ProductName, ps.ProductID, ps.InvoiceID AS InvoiceNumber, ps.Price, ps.ItemCount, ps.InvoiceTotal
FROM DimProducts p
JOIN FactProductSales ps
ON p.ProductID = ps.ProductID
ORDER BY p.ProductName


--Looks peachy but we know some of those product IDs are bogus.
--Let's fix it.

--First we need keep track of the dups
--If you have a large amount of dups, use a
--temp table instead.
DECLARE @DuplicateProductIDs TABLE(RowNumber INT, ProductID BIGINT, ProductName NVARCHAR(50))

INSERT INTO @DuplicateProductIDs(RowNumber, ProductID, ProductName)
SELECT ROW_NUMBER() OVER (PARTITION BY ProductName ORDER BY ProductName) AS RowNumber, ProductID, ProductName
FROM DimProducts

SELECT * FROM @DuplicateProductIDs

--Let's fix our child table so the records point to the right product
--First we need to transform the Dup ID info so we know which dup ID
--belongs to the original ID
DECLARE @RealProductIDsWithDupes TABLE(OriginalProductID BIGINT, DuplicateProductID BIGINT, ProductName NVARCHAR(50))

;
WITH OriginalProductIDs(OriginalProductID, ProductName)
AS(
SELECT ProductID, ProductName
FROM @DuplicateProductIDs
WHERE RowNumber = 1
),
DuplicateProductIDs(DuplicateProductID, ProductName)
AS(
SELECT ProductID, ProductName
FROM @DuplicateProductIDs
WHERE RowNumber <> 1
)
INSERT INTO @RealProductIDsWithDupes(OriginalProductID, DuplicateProductID, ProductName)
SELECT opid.OriginalProductID, dpid.DuplicateProductID, opid.ProductName
FROM OriginalProductIDs opid
JOIN DuplicateProductIDs dpid
ON opid.ProductName = dpid.ProductName

--check
SELECT * FROM @RealProductIDsWithDupes

--Now let's fix our child table

--Before
SELECT * FROM FactProductSales ORDER BY ProductID

UPDATE ps
SET ps.ProductID = rpids.OriginalProductID
FROM FactProductSales ps
JOIN @RealProductIDsWithDupes rpids
ON rpids.DuplicateProductID = ps.ProductID

--After
--Number of rows should not have changed but the
--number of distinct product IDs should be reduced.
SELECT * FROM FactProductSales ORDER BY ProductID

--Now that we won't violate any FK restraints,
--we can safely blow away dups in the parent

DELETE 
FROM DimProducts
WHERE ProductID IN(
SELECT ProductID
FROM @DuplicateProductIDs
WHERE RowNumber <> 1
)

--Check
SELECT * FROM DimProducts

--Final result
SELECT ps.InvoiceID AS InvoiceNumber, ps.ProductID, p.ProductName,  ps.Price, ps.ItemCount, ps.InvoiceTotal
FROM DimProducts p
JOIN FactProductSales ps
ON p.ProductID = ps.ProductID
ORDER BY p.ProductID



--Clean up your mess
DROP TABLE FactProductSales
DROP TABLE DimProducts
PreviousLesson 51. Reduce Code And Save Time With Default Column ValuesNextLesson 53. Why You Cannot Have More Than One Clustered Index On A Table

Last updated 3 years ago