Inserting into Large Parent Child Dimensions

Andrew Smith Posted on

In this post I’m implementing another of John Welch’s patterns from his SQLBits session on Handling Advanced Data Warehouse Scenarios in SSIS.

It’s an efficient pattern for performing inserts into large parent child dimensions that delivers its speed through only having to do 2 passes over the source data. It involves a little complexity to accomplish this and so would be overkill for small dimensions, but for large scale situations it’s a great pattern to work with.

I’m using the same sample data as John, where my initial parent child Employees dimension table and load data table are as follows:

PCDimTables

The control flow for the pattern is this:

PCDimControlFlow

and the first data flow is this:

PCDimDataFlow1

The control flow queries for the highest key used in the Employees dimension, stores it in a variable and then executes 2 data flows, which implement 2 passes over the inserts for the Employees dimension.

The first data flow starts by doing a fully cached lookup on the Employees dimension to see whether the incoming records already exist. If they do they are ignored (for simplicity here – in practice some processing may take place). The No Match output for the lookup flows to a C# script component, which has a private member initialised in its PreExecute() method to be the latest key value retrieved by the control flow, and then has a ProcessInputRow() method that just increments this for each row processed, and puts the resulting value into the current row, overwriting the source key value.

The output of this goes into a cache transform component, to load an in-memory cache that will be accessed during the second pass. (The “Use file cache” option for the cache’s cache connection manager is not checked, so this is appropriate when the cache is known to comfortably fit into memory.)

Then a lookup is used to see if the new record’s parent (i.e. manager) is already in the Employees dimension table. If it is, then the new record is inserted into the dimension table, using a ManagerID column output by the lookup that contains the parent’s EmployeeID value. But if the new record’s parent is also a new record (i.e. assuming the ETL is structured to not have missing parents), then the new record is written to a raw file destination, ready for fast reloading in the second pass.

The second data flow is:

PCDimDataFlow2

The raw file storage from the first pass is loaded, and then a conditional split handles the special case where the top level manager for the organisation (i.e. having no parent) is being loaded. All other cases go into a lookup component, which looks up against the Cache Connection Manager used by the cache transform in the first pass, using the manager name of the load data to match with the employee name in the cached data of new employee records. The union transformation then writes records from both paths to the Employees dimension table (which could alternatively be a staging table).

Note that for inserts directly into the Employees table as above, any foreign key constraint from the ManagerId column back to the EmployeeId column would need to be disabled for the duration of the load, since this pattern does not order insertions to ensure that parents arrive before their children.

The SSDT screenshots above show the project being run for the sample data shown previously, and where the resulting Employees dimension table is:

PCDimTableResults

It’s a nice pattern – relatively easy to understand and very efficient in its operation, and easy to extend to include more complex ETL logic.