In some situations it’s appropriate to push operations down to the underlying RDBMS, that could have otherwise been performed by SSIS.
This is subject to the RDBMS having the capacity available to perform these operations, but assuming it does, then this is a useful option to have to hand. Several options for doing this are described below, with SQL Server used as the RDBMS.
In this case, built-in SQL functions are used within extract queries, to do simple transformations. Examples are using string functions to trim or to concatenate fields, and using COALESCE() to replace NULLs with a default value when NULLs aren’t allowed for a target column.
One point to note here is how the use of such functions can affect data type sizing within the data flow. For example if I have this SQL within an SSIS Lookup transformation query:
and look at its representation in the Advanced Editor for the next downstream component, we see a 30 character Unicode string, which is exactly how the contactname column was defined in the source:
But if I change the source query to include the REPLACE() function:
then the Advanced Editor now shows:
So suddenly much more buffer space is taken up, which aside from making the dataflow pipeline less efficient, also significantly increases the system RAM used by the lookup cache. The fix here is to CAST() the result of the REPLACE() back to a 30 character Unicode string:
which results in the Advanced Editor showing 30 characters again. I’m assuming the original issue occurred because REPLACE() can potentially increase the size of the source string above its original column size, and so SSIS must allow for a resulting string of almost any size.
Aside from casting within the source query for the reason just mentioned, use of CAST() within the database engine has other important roles to play.
Microsoft’s Matt Masson presented a slide similar to this at SQLBits this year, that showed the number of conversions a single source column data type can go through whilst being loaded into the SSIS data flow:
Each of these conversions takes processing power and time, so to minimise type conversions along this path, use of the SQL CAST() function can be used in the extract query, to map the incoming data to a type that closely matches its use in the data flow.
An important point here is regarding source input column types used within the data flow. It’s safest not to assume that all source table columns have sensible types and that mapping them directly to SSIS types is the right thing to do. It may be that a table column has a type that’s much larger than is required for any value that will ever be entered into it. This excess sizing would reduce the number of rows that SSIS could fit into each buffer (and take up unnecessary space at the target).
So assuming the data flow has been explicitly designed with optimal types, the purpose of the CAST() in this case is to correct the RDBMS type(s) on-the-fly to something smaller/sharper, whilst still making sure they map optimally to the target SSIS types, with regards the above diagram.
A simple control flow for performing set-based SQL transformations looks like:
So the first Data Flow Task loads data into a staging area, the Execute SQL Task does the cleansing, and the second Data Flow Task loads the cleansed output into production. The Execute SQL task could be calling SQL statements (e.g. UPDATE) directly, or it could be calling a stored procedure that contains the cleansing logic.
One very significant advantage of this type of cleansing via the database engine is that set-based operations can be performed, as opposed to the row-by-row operations that occur within SSIS.
Sort is a fully blocking transformation in SSIS and is only efficient if it can perform its sort within memory. This means firstly that it can use a lot of memory, and secondly that once available memory has been used up it will start spooling buffers to disk. So pushing a sort down to the database engine where possible can be a better alternative, especially if the database has indexes to support the sort.
The SSIS Aggregate transformation is also fully blocking, and so for the same reason as Sort it’s a candidate for being pushed down to the database engine. Queries with a GROUP BY clause can calculate count, sum, average, minimum and maximum aggregates to give the same results as the SSIS Aggregate transform.
If an SSIS Merge Join can be pushed down to the database engine as a relational join then it will execute faster through running as a set-based operation. It also avoids any sorting in the data flow that’s required to give the Merge Join its sorted inputs. Index support for the join in the RDBMS will help further.
A typical SSIS “Upsert” pattern looks like this:
So this takes an input file, uses a key lookup to see whether each incoming row already exists, and then either inserts the row as new or updates an existing row as appropriate.
This row-by-row approach can be replaced by the T-SQL Merge statement, which can replicate the same functionality (and more!). So in the case of the data flow shown above, the flat file would first need to be loaded into a staging table, and then a T-SQL MERGE statement could be executed from an Execute SQL task in the control flow.