The Work Pile Pattern is a design approach that can be used to implement elastic horizontal scalability for SSIS processing.
In contrast to parallelism that can be hard-coded into the SSIS control flow or data flow (e.g. using the Balanced Data Distributor), a Work Pile implementation allows any number of instances of a single package to be executed in parallel, e.g. across multiple cores or multiple servers.
The essence of it is a database table that new work items are inserted into, and an agent (or agents) that interrogate the table periodically to pull out work items for processing via SSIS packages. Agents may run on parallel servers, where each server requires an installation of SSIS, a SQL Server instance and SQL Server Agent. The same set of SSIS packages are deployed to the SSIS catalogs on all nodes.
I attended Matt Masson’s SSIS Performance Design Patterns pre-con training day at SQLBits this year, where he spent time discussing work pile patterns, including examples currently in use in industry. One suggestion he made was to fire off packages using the SSIS T-SQL Catalog API (for SQL Server 2012 & onwards), i.e. using catalog.create_execution() and catalog.start_execution().
In this post I describe a simple work pile setup that I’ve implemented, that tests the fundamentals of this technique.
In my example, SSIS just has one task, which is to load flat files into a single table. So if there are 100s of files to load, the work pile technique allows the SSIS package to be run in parallel to minimise the total elapsed time for the load (i.e. assuming concurrent bulk loading by using the SSIS OLE DB destination fast load and TABLOCK options to provide for BU locks on the target table).
This is my Work Pile table, with some sample data:
Notable columns are SourceFileName, which contains the name of the flat file to be processed and SSISPackage, which contains the name of the package to be invoked via the Catalog API. StatusFlag flows from “U” (Unprocessed) to “P” (Processing) to “C” (Complete).
As an aside, note that in addition to being a vital part of the work pile mechanism, this table can be used as the basis of a monitoring tool for the work pile. It would be relatively easy to set up a web page or other GUI that reads this table and uses it to display near real-time status of work pile items.
SQL Server Agent is set up to run a job every minute that uses T-SQL EXEC to call a stored procedure that looks for unprocessed work items. The procedure checks one out (by highest priority) and asynchronously calls the SSIS package named in the work item row via the Catalog API. It can be seen here:
So the code just calls the sp_GetWorkItem() stored procedure to find any waiting work items, and then uses the SSIS Catalog API to execute the appropriate SSIS package. The SourceFileName and ItemKey values from the work pile table are passed into the package as required parameters.
The sp_GetWorkItem() procedure that’s called above is:
The use of the exclusive app lock is explained in the Locking section below, but aside from that there’s just an Update statement that retrieves the next work item for processing and updates its StatusFlag, StartTime and HostName columns.
The control flow for the SSIS package executed by the Catalog API is very simple:
It’s just a Data Flow Task – which in my example loads a flat file into a table – and an Execute SQL Task, which calls this stored procedure:
This updates the EndTime and RowsProcessed values and changes the StatusFlag of the work pile item to completed. The exclusive app lock is explained in the next section.
Of the 2 parameters passed into the SSIS package, the source file name is used by a paramaterised connection string in the data flow’s flat file connection manager, and the work item key is used in the stored procedure just above.
It’s well documented that using a database table as a queue is problematic because of concurrent updates to it being susceptible to deadlocks. If the updates were simple it wouldn’t be a problem, since SQL Server would use Update locks and then Exclusive locks, specifically to avoid deadlocks from occurring. The problem with the update in the sp_GetWorkItem() stored procedure above though is with the subquery. This will acquire Shared locks, and it’s this that raises the possibility of deadlocks.
There are some colourful workarounds to this in various posts on the internet, typically using multiple lock-related table hints in conjunction with common table expressions. I’d want to do a lot of high concurrency testing with any of them though, before I began to trust them.
So instead, for the latter two stored procedures above, I’ve serialised access to the work pile table via an application lock, which uses an exclusive lock on a named resource for the duration of the transaction. This is simple, easy to understand and full proof. And in work pile ETL scenarios where the parallelism is just running a handful or a few dozen concurrent packages in a load window overnight, there will be little activity on the work pile table anyway and any brief delays waiting for the lock will be negligible relative to the time taken for each package to run. For massively concurrent scenarios where it is a problem, then another solution would have to be found, e.g. using a proper message queue.
As a side note, I’ve used explicit transactions in both stored procedures to make clear where I’m happy for the current transaction to end, and so for the application lock to be released.
Using the SSIS T-SQL Catalog API to implement part of the Work Pile Pattern was straight forward and intuitive, and was easy to get working. It avoids the need for the SSIS package that’s typically executed by work pile SQL Server Agent jobs, and which contains logic to access the work pile table to pull out new work items and to then call an appropriate child package.
Ultimately though, whether T-SQL is the best choice for this task or whether an SSIS package is, is a decision that needs to be made relative to the demands of each individual work pile implementation.