Tutorials
Practical T-SQL
Practical T-SQL
  • Practical T-SQL Pocket Guide For Beginners
  • Preface
    • Section I. A Note From The Author
    • Section II. Tutorial Overview
    • Section III. Running The Examples
    • Section IV. How To Report An Issue
    • Section V. Join The MSU Community On Discord
    • Section VI. Supplementary Material
  • Language Basics
    • Lesson 1. Obligatory Hello World
    • Lesson 2. Code Comments With T-SQL
    • Lesson 3. Basic Syntax
    • Lesson 4. Your First Query
    • Lesson 5. Filtering Data
    • Lesson 6. Sorting Data
    • Lesson 7. Complex Data Filtering
    • Lesson 8. Aliases
    • Lesson 9. String Functions
    • Lesson 10. Creating New Columns From Existing Data (Calculated Fields)
    • Lesson 11. Displaying Data Based On Conditions (Case Statement)
    • Lesson 12. Aggregate Functions
    • Lesson 13. Grouping And Summarizing Data
    • Lesson 14. Querying More Than One Table
    • Lesson 15. Combining Queries
    • Lesson 16. Subqueries
    • Lesson 17. Creating Data
    • Lesson 18. Updating Data
    • Lesson 19. Deleting Data
    • Lesson 20. Common Table Expressions (CTEs)
    • Lesson 21. Derived Tables
    • Lesson 22. Putting It All Together
  • Advanced Topics
    • Lesson 23. Selecting Unique Values
    • Lesson 24. Updating Data With A Join
    • Lesson 25. Data Types
    • Lesson 26. Casting Data Types
    • Lesson 27. Creating Tables
    • Lesson 28. Altering Tables
    • Lesson 29. Dropping Tables
    • Lesson 30. Variables
    • Lesson 31. Controlling Flow
    • Lesson 32. Looping
    • Lesson 33. Error Processing
    • Lesson 34. Temporary Tables
    • Lesson 35. Views
    • Lesson 36. Indexed Views
    • Lesson 37. User Defined Functions
    • Lesson 38. Stored Procedures
    • Lesson 39. BULK INSERT
    • Lesson 40. Loading Tables With MERGE
    • Lesson 41. Partitioning A Dataset
    • Lesson 42. Pivoting Data
    • Lesson 43. Dynamic SQL
    • Lesson 44. Cursors
  • Solutions To Real World Problems
    • Lesson 45. Listing All Tables In A SQL Server Database
    • Lesson 46. Listing All Columns In A SQL Server Database
    • Lesson 47. Pull Records From A Table At Random
    • Lesson 48. A Better Alternative To WITH (NOLOCK)
    • Lesson 49. Boost Performance When Calling A Stored Proc From SSIS
    • Lesson 50. Setting Up Queries For Ablation Testing
    • Lesson 51. Reduce Code And Save Time With Default Column Values
    • Lesson 52. Finding Duplicate Records In A Table
    • Lesson 53. Why You Cannot Have More Than One Clustered Index On A Table
    • Lesson 54. Converting Dates To YYYYMMDD
    • Lesson 55. Sending Notification Emails With T-SQL Without Using Hardcoded Email Addresses
    • Lesson 56. Troubleshooting Long Running Queries
    • Lesson 57. Loading Large CSVs Into Data Warehouse Staging Tables
    • Lesson 58. The Only Bloody Good Reason To Use Cursors
    • Lesson 59. Loading A Type II Slowly Changing Dimension With SQL Merge
    • Lesson 60. A Clearer Explanation Of The Parameters Of The Numeric Data Type
    • Lesson 61. Why You Cannot Join On Null Values
    • Lesson 62. A Deep Dive On How The Where Clause Functions
    • Lesson 63. Using HASHBYTES() To Compare Character Strings
    • Lesson 64. Using Pipe To Hash Multiple Columns For Matching
    • Lesson 65. Why People That Indent Code Drive Me Nuts
    • Lesson 66. How To Rapidly Stand Up A Data Warehouse From Scratch
    • Lesson 67. How To Pivot Data With T-SQL When Columns Are Not Predefined
    • Lesson 68. Prepopulating A Junk Dimension
Powered by GitBook
On this page
  1. Solutions To Real World Problems

Lesson 59. Loading A Type II Slowly Changing Dimension With SQL Merge

If you are a regular user of SSIS to do ETL work, you are probably aware of the Slowly Changing Dimension Transformation. You are also probably keenly aware of its performance limitations. Even if it were the highest performing component, I still would not use it. My personal philosophy when it comes to ETL is to use SSIS for orchestration. All the heavy lifting of loading and cleaning data, I do in T-SQL.

Loading data into tables is usually a straightforward process. Loading Type II SCDs is anything but straightforward. Fortunately, this job is made easier by the T-SQL MERGE statement. As far as I know, MERGE is the most powerful, complex operation in the entire T-SQL toolbox. It is chock full of features and options that give engineers a lot of flexibility.

Much has already been written on how to load SCDs. Some of that material is of varying quality. I do not want to spend time going over well-trodden ground.

What I want to do instead is show how this is done in the real world. There is a bug in SQL Server that many code samples do not address. The MERGE statement will fail if you attempt to load a Type II SCD with common techniques found on the internet. The problem is, you are not allowed to load a dimension with a foreign key in this way. The workaround is to use a temp table. That technique is shown below.

To run this code, you will need a copy of SQL Server 2016 which is free for development. I explain what I am doing step by step below, but the script can be run in totality with a single press of F5.

This is fairly close to how I do ETL in the real world. In general, the pattern is:

  1. Import data to staging.

  2. Load new data into the common model.

  3. Load the data warehouse from the common model.

The following are differences between the script and a real-world process:

  1. I put common model tables in their own schema.

  2. Staging tables are also grouped by schema according to the source data.

  3. All of the data quality checks have been removed.

  4. Post processing ETL load checks have been removed.

Key to loading SCDs is how you handle effective to and from dates. Some people think that “to” dates are not even necessary. The to/from scheme is what I encounter the most out in the wild. While they are called SLOWLY changing dimension, stuff happens. To protect against a change within the same day, I have chosen to end the previous record a second before the new one. This approach has its issues:

  • You have overlapping days.

  • Should you have to reload a fact table, overlapping days could cause problems.

  • This approach could play havoc with any BI tool that sits on this data.

I am sure there are more issues that I could think of. At the end of the day, there is no right answer. This approach works for my use case. Under different circumstances, I will use a different solution. The main thing is that executives get their data and leave you alone so you can create more cool things.

Examples

Comprehensive Type II SCD Load Example

In [ ]:

USE demo

--Begin by clearing your work area.
IF OBJECT_ID('DimTickers') IS NOT NULL DROP TABLE DimTickers
IF OBJECT_ID('tempdb..#DimTickers') IS NOT NULL DROP TABLE #DimTickers
IF OBJECT_ID('DimTickersCM') IS NOT NULL DROP TABLE DimTickersCM
IF OBJECT_ID('TickersStage') IS NOT NULL DROP TABLE TickersStage

--Set up date variables.
--Every organization should have an arbitrary high and low date
DECLARE @LowDate AS DATETIME = '19000101'
DECLARE @HighDate AS DATETIME = '99991231'

--Now create all the tables we will need.
--Create our stage table
CREATE TABLE TickersStage(
ETLKey uniqueidentifier NOT NULL,
Symbol nvarchar(255) NULL,
CompanyName nvarchar(255) NULL,
UniqueDims varbinary(35) NULL,
UniqueRows varbinary(16) NULL,
SourceSystem nvarchar(255) NULL,
ErrorRecord bit NULL,
Processed bit NULL,
RunDate datetime NULL,
RowHash  AS (CONVERT(binary(16),hashbytes('MD5',CompanyName),0)) PERSISTED,
 CONSTRAINT PK_DimTickersCM PRIMARY KEY CLUSTERED 
(
ETLKey ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE TickersStage ADD  CONSTRAINT DF_DimTickersCM_ETLKey  DEFAULT (newid()) FOR ETLKey

--Let's create our warehouse table.
--In the real world, this table would have a FK constraint
--as it is the parent to a fact table.
CREATE TABLE DimTickers(
TickersCK bigint IDENTITY(1,1) NOT NULL,
Symbol nvarchar(50) NULL,
CompanyName nvarchar(100) NULL,
CreatedBy nvarchar(50) NULL,
CreatedOn datetime NULL,
UpdatedBy nvarchar(50) NULL,
UpdatedOn datetime NULL,
SourceSystem nvarchar(100) NULL,
SourceSystemKey nvarchar(100) NULL,
EffectiveFrom datetime NULL,
EffectiveTo datetime NULL,
IsMostRecentRecord bit NULL,
RowHash  AS (CONVERT(binary(16),hashbytes('MD5',CompanyName),0)) PERSISTED,
 CONSTRAINT PK_tickers PRIMARY KEY CLUSTERED 
(
TickersCK ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]


--Let's create our common model table
CREATE TABLE DimTickersCM(
TickersCK bigint NULL,
Symbol nvarchar(50) NULL,
CompanyName nvarchar(100) NULL,
CreatedBy nvarchar(50) NULL,
CreatedOn datetime NULL,
UpdatedBy nvarchar(50) NULL,
UpdatedOn datetime NULL,
SourceSystem nvarchar(100) NULL,
SourceSystemKey nvarchar(100) NULL,
EffectiveFrom datetime NULL,
EffectiveTo datetime NULL,
IsMostRecentRecord bit NULL,
RowHash  AS (CONVERT(binary(16),hashbytes('MD5',CompanyName),0)) PERSISTED,
)


--Lets create our mirror temp table
CREATE TABLE #DimTickers(
TickersCK bigint NULL,
Symbol nvarchar(50) NULL,
CompanyName nvarchar(100) NULL,
CreatedBy nvarchar(50) NULL,
CreatedOn datetime NULL,
UpdatedBy nvarchar(50) NULL,
UpdatedOn datetime NULL,
SourceSystem nvarchar(100) NULL,
SourceSystemKey nvarchar(100) NULL,
EffectiveFrom datetime NULL,
EffectiveTo datetime NULL,
IsMostRecentRecord bit NULL,
RowHash  AS (CONVERT(binary(16),hashbytes('MD5',CompanyName),0)) PERSISTED
)



--Now we will insert our initial load. 
--Funny enough, I actually did find an error in my production data. 
--We will use that as an example of something messed up.
INSERT INTO TickersStage(Symbol,CompanyName,SourceSystem,ErrorRecord,Processed,RunDate)
SELECT 'AAPL','Apple Inc','Yahoo',0,0,CURRENT_TIMESTAMP
UNION
SELECT 'UMBF','Ump Financial Corp','Yahoo',0,0,CURRENT_TIMESTAMP --This is an actual error I pulled from prod!
UNION
SELECT 'ACN','Accenture Plc','Yahoo',0,0,CURRENT_TIMESTAMP


--Now we will begin the process of actually loading the warehouse.
--Move data from staging to common model.
TRUNCATE TABLE DimTickersCM

INSERT INTO DimTickersCM(
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey
)
SELECT
Symbol,
CompanyName,
SourceSystem,
Symbol AS SourceSystemKey
FROM TickersStage
WHERE Processed = 0
AND ErrorRecord = 0


--Handle New Records
MERGE DimTickers AS target
USING (
SELECT
Symbol,
CompanyName,
SourceSystem,
Symbol AS SourceSystemKey
FROM DimTickersCM
) AS source
ON target.SourceSystemKey = source.SourceSystemKey

WHEN NOT MATCHED THEN
INSERT (
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
EffectiveFrom,
EffectiveTo,
IsMostRecentRecord,
CreatedBy,
CreatedOn
)
VALUES (
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
@LowDate,
@HighDate,
1,
SYSTEM_USER,
CURRENT_TIMESTAMP
);

--Let’s check our work so far. You should get three records in DimTickers.
SELECT * FROM DimTickers


--That's our inital load. Let's go to day 2 and fix that
--embarrassing data entry error for UMB!
TRUNCATE TABLE TickersStage

INSERT INTO TickersStage(Symbol,CompanyName,SourceSystem,ErrorRecord,Processed,RunDate)
SELECT 'UMBF','UMB Financial Corp','Yahoo',0,0,CURRENT_TIMESTAMP --Let's fix the company name so it's correct

--Now to fix things.
TRUNCATE TABLE DimTickersCM

INSERT INTO DimTickersCM(
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey
)
SELECT
Symbol,
CompanyName,
SourceSystem,
Symbol AS SourceSystemKey
FROM TickersStage
WHERE Processed = 0
AND ErrorRecord = 0



--Handle changed records

--Here is where things start to differ
--We are going to use a MERGE like normal, 
--but we are going to add an OUTPUT statement 
--and dump the results to a temporary table 
--that has no foreign key constraint
--which we'll later flush to prod.
INSERT INTO #DimTickers(
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
EffectiveFrom,
EffectiveTo,
IsMostRecentRecord,
CreatedBy,
CreatedOn
)
SELECT
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
EffectiveFrom,
EffectiveTo,
IsMostRecentRecord,
CreatedBy,
CreatedOn 
FROM(--Here we're using the MERGE statement as a derived table.
MERGE DimTickers AS target
USING (
SELECT
Symbol,
CompanyName,
SourceSystem,
Symbol AS SourceSystemKey, 
RowHash
FROM DimTickersCM
) AS source
ON target.SourceSystemKey  = source.SourceSystemKey 
WHEN MATCHED
AND source.RowHash <> target.RowHash --Look for changed data
AND target.IsMostRecentRecord = 1
THEN
UPDATE
SET
UpdatedBy = SYSTEM_USER,
UpdatedOn = CURRENT_TIMESTAMP,
EffectiveTo = DATEADD(ss,-1,CURRENT_TIMESTAMP), -- Make sure things don't overlap. There is some controversy doing it this way.
IsMostRecentRecord = 0
OUTPUT --Output columns need to match your select statement
$action Action_Out,
source.Symbol,
source.CompanyName,
source.SourceSystem,
source.SourceSystemKey,
CURRENT_TIMESTAMP AS EffectiveFrom,
@HighDate AS EffectiveTo,
1 AS IsMostRecentRecord,
SYSTEM_USER AS CreatedBy,
CURRENT_TIMESTAMP AS CreatedOn
) AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = 'UPDATE'
;

--Now we flush the data in the temp table into prod.
INSERT INTO DimTickers(
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
EffectiveFrom,
EffectiveTo,
IsMostRecentRecord,
CreatedBy,
CreatedOn
)
SELECT
Symbol,
CompanyName,
SourceSystem,
SourceSystemKey,
EffectiveFrom,
EffectiveTo,
IsMostRecentRecord,
CreatedBy,
CreatedOn
FROM #DimTickers


--Go check DimTickers again. You should have four records now - three original and one extra for UMB. 
SELECT * FROM DimTickers ORDER BY Symbol, EffectiveTo 

--Do not forget to clean up your environment.
DROP TABLE DimTickers
DROP TABLE DimTickersCM
DROP TABLE #DimTickers
DROP TABLE TickersStage
PreviousLesson 58. The Only Bloody Good Reason To Use CursorsNextLesson 60. A Clearer Explanation Of The Parameters Of The Numeric Data Type

Last updated 3 years ago