SSIS provides for parallelism at multiple levels: (i) by running packages concurrently, (ii) by running tasks concurrently within a package, and (iii) by running transformations concurrently within a data flow.
This post investigates parallelism within the data flow, by comparing the time taken between non-parallel and parallel approaches.
This simple data flow is used as the non-parallel starting point for the tests:
It reads 5 million rows from a table that includes 5 datetime2 columns, adds 5 derived columns that each process a different datetime2 column into a local specific string format, and then writes all the columns out to another table.
Each derived column uses a form of this expression:
(DT_WSTR,4)DATEPART(“yyyy”,date1) + “-” + (DT_WSTR,2)DATEPART(“mm”,date1) + “-” + (DT_WSTR,2)DATEPART(“dd”,date1) + ” ” + (DT_WSTR,2)DATEPART(“hh”,date1) + “:” + (DT_WSTR,2)DATEPART(“mi”,date1) + “:” + (DT_WSTR,2)DATEPART(“ss”,date1)
(Such a conversion could have been pushed down to SQL Server and performed via FORMAT(), but it is used here to serve as a non-trivial expression.)
The time taken to run the data flow via the SSIS catalog with basic logging was 68 seconds.
I then rewrote the data flow to look like this:
Here, each derived column transformation just processes 1 column, with each transformation assigned to a different column. This means all 5 derived column expressions are now candidates for running in parallel. The time taken for this version to run was 50 seconds – a healthy 26% improvement.
This works because each execution tree within a data flow can receive multiple threads, with each transformation within the tree being a candidate for having its own thread. The SSIS backpressure mechanism limits activity to 5 active buffers per execution tree though, so the extent to which this degree of parallelism can be extended is limited.
So here, the first buffer was passed down from the source to the first derived column transformation. Once it had processed its single column, the buffer was then passed down to the second transformation. So at this point the second transformation could process its column in that first buffer, at the same time the first transformation was processing the second buffer passed down from the source. Once the fifth buffer has left the source, all 5 transformations have the potential to be processing their own buffers concurrently.
The SSIS Data Flow Task EngineThreads property comes into play here, which is a hint regarding the maximum number of threads a data flow can run, for transforms and for source components. For the above test it was left at its default of 10.
Careful benchmarking in a representative environment is required for optimising hard-coded parallelism such as this, e.g. to take into consideration Microsoft’s warning for this context: “The general rule is not to run more threads in parallel than the number of available processors. Running more threads than the number of available processors can hinder performance because of the frequent context-switching between threads.”
Another approach to parallelism within the data flow is to use the Balanced Data Distributor (BDD):
The BDD is available from Microsoft as an optional download for SSIS on SQL Server 2012 and 2014, and is now included as standard for SQL Server 2016. It assigns incoming buffers sequentially to its outputs in a round-robin fashion. The processing that occurs on each output path then has the potential to run in parallel. So here, as in the original non-parallel example above, each derived column transformation is back to running the expressions for all 5 datetime2 columns.
To confirm that the BDD output paths have the potential to run in parallel, I turned on PipelineExecutionTrees in SSDT logging. This shows that a Subpath in the tree is created for each output, thus providing for inherent parallelism.
This data flow ran the benchmark test in 54 seconds, so usefully quicker than the non-parallel case, but a little slower than the prior parallel example. This slow down may be accounted for by the fact that each BDD derived column transformation has 5 expressions to calculate, which must be done serially per row.
One point to make regarding all the above is that of maintainability of the package. The original non-parallel case would be the simplest to maintain, and the BDD version the most difficult, i.e. due to the 5 derived column expressions being duplicated across each derived column transformation, as well as the duplicated settings across the OLE DB destinations. (Although this latter duplication could be removed by doing a Union All to merge the derived column outputs, and sending the result to just a single destination. This would also accommodate destinations that don’t support parallel inserts.)
In summary, SSIS has an array of parallelism options, and where the best approach for any given ETL requirement will usually be a compromise between competing demands. Careful benchmarking is always required to ascertain the optimal solution. As the simple test case above shows, the effect of parallelism on performance can be very significant.