This post completes my mini-series on implementing the Kimball-based SSIS patterns described by John Welch in his SQLBits session on Handling Advanced Data Warehouse Scenarios in SSIS.
The task here is how to best implement inserts and updates to slowly changing dimensions with Type 6 attributes. The Type 6 name was coined because this method borrows from the Type 1, Type 2 and Type 3 SCD techniques (and 1+2+3 = 6). Basically (i) full history of the Type 6 attribute for a particular business key is captured by adding a new row for each change (Type 2), (ii) there are 2 columns representing the attribute in a row – one for the latest value and one for the prior (in this case original) value (Type 3), and then (iii) any time the attribute changes again, the latest value column of all history rows for the same business key gets overwritten to be the new latest value (Type 1). (Also see here for a description of hybrid dimensions on Ralph Kimball’s website.)
The benefit of doing this is that the data warehouse can be filtered or grouped by any of an attribute’s historical values or by its current value.
The Product dimension used for this example will begin empty and then be loaded with 3 successive sets of data:
So it can be seen that the category is changing for each novel for each load. The end result of these loads after using the Type 6 logic in the SSIS project to follow is a Product dimension that looks like:
I’ve added the surrogate key (ProductKey) and CurrentRow columns, which weren’t in Welch’s implementation (for simplicity I presume). To avoid clouding the core logic here with surrogate key creation though, I’ve just used the Identity property on the Product table’s ProductKey column, whereas in practice creating the surrogate key value within the data flow (e.g. via a script component – see here) can be a better alternative, especially when the value of the key being generated needs to be known at the time of creation.
The SSIS control flow and data flow for implementing the above are:
The first and third tasks in the control flow truncate and read from a staging table that’s used to implement set-based updates to the Category column in the Product dimension. Specifically it sets this column for all existing rows affected by the load to be the new category. It also sets the CurrentRow indicator column to ‘N’ for these same rows. This is the SQL:
SET Category = S.Category,
CurrentRow = ‘N’
FROM dimProductStage AS S
WHERE dimProduct.BookID = S.ID
AND dimProduct.Category <> S.Category
For the data flow, the initial lookup does a fully cached key lookup on the Product dimension to see whether the rows coming in are new or not. If they are new then they go down the No Match path, have their CategoryHistory column set to be the same as their Category column, and are inserted into the dimension with CurrentRow = ‘Y’.
If they are existing rows, then a simple check is made to see whether the category column retrieved by the lookup is the same as the incoming category, and if not then the row goes both into the staging table (previously discussed) and into the Product dimension as a new insert, the latter with CurrentRow = ‘Y’.
This is a relatively simple, fast pattern which can be extended as required for more complex ETL processing.
As an aside from the specifics of this particular implementation, the use of Type 6 SCD attributes in dimensional modelling in general should be treated with some caution, since although they provide for flexible interrogation by reporting queries, the complexity they add to the model may make it difficult for some users to understand, in addition to the need for a more complex ETL.