Updating Type 2 SCD Attributes using the T-SQL Merge Statement

Andrew Smith Posted on

During Matt Masson’s SSIS Performance Design Patterns pre-con day at SQLBits last year, Matt discussed the use of the T-SQL Merge statement for implementing Type 2 SCD attribute updates.

In one of his posts on Microsoft’s SSIS Team Blog, he references the source of this idea as being a design tip on the Kimball Group’s website – see here for that tip.

In SSIS environments, the use of Merge for Type 2 updates can be considered alongside the native SSIS Slowly Changing Dimension component, as well as with third-party SSIS SCD components such as Todd McDermid’s SSIS Dimension Merge SCD Component.

Performance is a key criteria to consider when evaluating alternatives – the native SSIS SCD component is known to not be performant for non-trivial dimensions, due to it using non-cached lookups and row-by-row updates, whereas McDermid’s component is stated as being 100x faster. The latter still relies on the use of row-by-row updates via the SSIS OLE DB Command component to perform updates for Type 1 and Type 2 changes though, so it will generally not outperform the T-SQL Merge technique.

Setup and maintainability are other important criteria to consider when evaluating SCD update options – the SSIS components have the benefit of a GUI to make setup generally easier than constructing a Merge statement, although with the native SSIS component the wizard for setting it up is ‘one way’, i.e. running it again to change column selections means any customisations made to the output of its previous run will be lost.

Another factor to consider is where the source data is located. The T-SQL Merge statement relies upon the source data for the update being in a table, whereas the SSIS components mentioned above can take it from any SSIS source.

For demonstrations of all 3 SSIS SCD update alternatives outlined above, see the YouTube videos on Todd McDermid’s CodePlex page here.

I’m specifically demonstrating the Merge technique in this post, and I’m doing it within this simple SSIS control flow:

MergeSCDControlFlow

The first 2 tasks set up the staging table containing the dimension updates that will be used by the Merge statement, and the 3rd task contains the Merge T-SQL:

MergeSCDTSQL

There are 3 T-SQL statements here, rather than just a single Merge statement, because this implementation solves a problem that occurs when the Merge statement is used on its own and where the target dimension table is involved in a foreign key relationship. An example of this error is:

Msg 356, Level 16, State 1, Line 68
The target table ‘dimProduct’ of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint ‘fk_ProductKey’.

So what happens here instead is that the inserts resulting from Type 2 updates to existing rows go into a staging table, and then in the final statement of the batch they are loaded into the dimension table.

Note also this section from the Merge statement:

MergeSCDTSQLNullCheck

This idea comes from this post, and replaces the more conventional construct for considering NULLs in the source or target columns. When multiple columns are being considered for Type 2 updates, the SQL can get quite clunky for working out which attributes have changed for a particular business key, e.g. for just 2 columns we might have:

WHEN MATCHED
  AND (Target.CustomerName <> Source.CustomerName
  OR Target.CustomerName IS NULL
  OR Source.CustomerName IS NULL
  OR Target.Planet <> Source.Planet
  OR Target.Planet IS NULL
  OR Source.Planet IS NULL)

In addition if both source and target values are NULL for a column, then this expression evaluates to True, which isn’t what’s desired since there’s been no change. A more elegant solution, which also copes with both source and target being NULL is:

WHEN MATCHED
  AND EXISTS
      (SELECT Source.CustomerName, Source.Planet
       EXCEPT
       SELECT Target.CustomerName, Target.Planet)

The remainder of the Merge statement from the SSIS Execute SQL task above is relatively straight forwards – see this excellent series of blog posts if required for more details on how to use Merge.

Note that SCD Type 1 attribute changes could also be handled within the T-SQL shown above, by following the Merge logic with an Update statement that reads from the same staging table.

The results from running the SSIS package above are shown below. The initial dimension table is as follows, where Category is an SCD Type 2 attribute:

MergeSCDStart

The source for the dimension update is this:

MergeSCDSource

And the output of the package is this:

MergeSCDResults