Inserting Facts with Late Arriving Dimensions

Andrew Smith Posted on

In John Welch’s SQLBits session on Handling Advanced Data Warehouse Scenarios in SSIS, he describes a pattern for handling late arriving dimension members for dimensional data warehouses.

The pattern is oriented around high performance, and so avoids row-by-row SQL updates and non-cached lookups.

The bulk of the logic is concerned with identifying missing dimension members and generating stub inserts for them, where only one stub is created even if the same missing dimension member is referenced many times within a load. So if we start with this customer dimension table:

LADStartDim

and load from this (simple!) fact table source:

LADStartFactSource

then the resulting dimension table is:

LADEndDim

So the late arriving DE3926 customer with 2 sales only appears once as an inferred member in the dimension table.

The control flow for the pattern is:

LADControlFlow

and the data flow (showing execution of the data load above) is:

LADDataFlow

For the data flow, the single record on the Lookup Match Output path to the left is the existing AB2394 customer, the single record going down the middle ProcessedRecord path is the 2nd instance of the new DE3926 customer, and the 2 records going down the NewRecord path contain details for the DE3926 and FF7726 inferred member inserts.

Key to this pattern working is the use of the SSIS script component. Custom C# within this component is:

LADScript1

and:

LADScript2

The 2 private members are (i) an instance of a dictionary collection that maps business (natural) keys to surrogate keys, and (ii) an integer that records the last used surrogate key. This integer is set to an SSIS variable that’s populated by the Execute SQL task in the control flow.

Each time a new customer record enters the script component, its business key is checked to see if it’s already in the dictionary. If it is then it’s assigned the surrogate key already generated for it and it’s sent out down the ProcessedRecord path, and if it’s not then a new surrogate key is generated for it, it’s added to the dictionary, and it’s sent out down the NewRecord path.

Thus already processed new dimension members are always found, with no duplicates being created (which wouldn’t necessarily be the case if the database was being looked up directly for new dimension inserts, since SSIS components doing such inserts will be waiting for full buffers to arrive before doing them).

In the data flow above I’ve shown the fact and dimension destinations being staging tables, which offers the most flexibility for a generic pattern. Set-based operations would then be done to implement fast loads into the target data warehouse tables. The pattern could alternatively write directly into either the target fact table or dimension table or both, although this is subject to restrictions such as foreign key constraints, where inferred dimension inserts would need to be completed before the fact table inserts that refer to them.