Staging the ETL

Andrew Smith Posted on

Whilst a key strength of SSIS is the in-memory pipeline used to implement its data flow, the staging of data to disk remains an important strategy, and can be essential for meeting some requirements.

In this post I look at alternative reasons for staging data within SSIS-based ETL processes, together with an approach for spec’ing out the staging area.

Auditing
For any non-trivial set of ETL transformations, validating that data produced at the destination is correct can be a challenge. Reasons for this include:
  • the lineage from the destination back to source being lost
  • the complexity of the transformations involved
  • the sheer volume of data being processed

Aside from normal testing procedures needing to cope with this, end-user support queries or formal audits may also require a way of checking the data flow from source to target, possibly days or weeks after a particular ETL job has run.

To help with this, if data is staged after the initial source extract and after each significant transformation, and these staged tables/files are archived, then it will be much easier to establish an audit trail for what happened during processing at any point back in time.

This may of course require a great deal of archive space over time, but this may be a small price to pay for being able to demonstrate with confidence the integrity of the destination data, as well as being essential for meeting any formal regulatory or compliance policies. And subject to audit-related requirements, it may be possible to purge older archives over time.

Pushing work down to the RDBMS or File System

Large database modification operations such as UPDATE and DELETE cannot be performed efficiently within the SSIS data flow since they operate on a row-by-row basis. Much higher performance can be obtained by writing intermediate data to staging tables and using the Execute SQL task to run set-based SQL operations on them.

Sorting is another candidate for RDBMS help via staging. For example imported flat files that need sorting could be written to a table that has a sort-friendly index built upon it, and where the subsequent query that reads this table uses ORDER BY to retrieve its contents. This avoids the possible use of much memory by the SSIS Sort transform as well as being much faster in some cases.

The file system is another candidate for increasing the performance of ETL operations. Say a filtered extract is required from large flat file sources, then it may well be quicker to use the grep command on them (i.e. for Unix, or else findstr for Windows) and have the operating system extract the required data than to have the filtering occur within the data flow.

Restartability

Staging data after extraction and after significant transformations is a key strategy for restartability. Assuming the appropriate package logic, restarts can just begin from the last staging point prior to any failure.

Aside from saving time, staging immediately after extraction is key for source systems that are continually modifying their own data. Without having staged the extraction initially, a second read from a source system may result in different data, possibly missing data that’s been overwritten or deleted in the meantime. It also avoids putting excessive load on the system by having to re-query extract data for every restart. And it provides consistency by having the source extract run at the same time every day/night.

Staging for restartability can be an important input into the design of the ETL workflow. Although there may be natural points for staging within an ETL process (e.g. after extraction and after cleansing), it may aid restartability to break down individual data flows still further, using staging via SSIS raw files to link them. (Raw files are relatively fast to write and re-read, since they’re implemented using the native SSIS buffer structure – making them preferable to staging via the RDBMS in some cases.)

Horizontal Scaling using Multiple Servers

When processing is complicated or data sets are large, it may be appropriate to run parts of the ETL in parallel over multiple servers. In this case, one part of the ETL process will partition incoming data – e.g. using an SSIS Conditional Split transformation – and where the outputs of this are staged to raw files:

StagingParallel

Each raw file is then picked up by a parallel server and processed, either with the results concurrently loaded by each server into the destination, or with the data being re-merged and loaded as a single batch. (Note that with this design any ordering required for the final load must not be lost during the parallelisation).

Processing Windows

If an ETL process is loading data from multiple source systems, and these extracts are not all available at the same time, it may be necessary to stage each one until the full data set is available. The same principle applies when the extract window for the source system(s) does not line-up with the load window for the destination system.

Source Backpressure

In some situations SSIS will throttle its rate for reading from source systems, i.e. if downstream processing would result in more than 5 active buffers for a data flow execution tree. For scenarios where it’s important to extract from the source system in as little time as possible, then doing a simple initial extract to staging will allow this, where all the time intensive processing happens thereafter.

Minimising the Cost of Staging

Whilst the previous paragraphs described reasons for staging, here I look at a novel approach for implementing staging, as detailed in the Wrox Professional book SQL Server 2014 Integration Services.

The book considers whether it’s possible to accomplish staging without staging I/O slowing down the data flow, such that SSIS’s in-memory pipeline can still be leveraged. The answer is yes, albeit only for a particular set of staging requirements. The idea is this:

Staging

So rather than the ETL having to synchronously write to staging and then read back out again, it uses a Multicast transformation to write to staging at the same time as passing the data down the data flow. Thus the data flow isn’t impeded yet the ETL is still staged immediately after extraction.

This is only applicable for some staging requirements – e.g. for when staged data is required for potential restarts or when staged data is required for archive for future audits – but in these cases it’s a very useful technique to have to hand.

Spec’ing out the Staging Area

Due to the volume and nature of data that can be written to disk during staging, it’s important to properly think through and document the use of staging area(s) during ETL design and development. This is particularly true if the staging area grows over time, as a consequence of the source system(s) and their extracts growing over time.

In Ralph Kimball’s and Joe Caserta’s book The Data Warehouse ETL Toolkit, they include a template for a volumetric worksheet that can be used as a reference for staging area structure and content.

I’ve created a (slightly modified) version here.

So this very concisely lists which tables are in the staging area (for a particular ETL stage), the nature by which they’re updated, what their current sizes are expected to be, and how they are likely to grow over time. Whilst this worksheet is table-based, it can easily be extended to accommodate staged files also.

Summary

Staging is an important consideration for many types of ETL process, and often results in a balancing act between satisfying key requirements and not sacrificing too much performance. Here I’ve described a number of diverse reasons for staging ETL data, together with a creative approach for implementing it, and a worksheet that documents its use.