Restartability in the SSIS Data Flow

Andrew Smith Posted on

Whilst there is no shortage of strategies for implementing restartability within the SSIS control flow (e.g. see here), achieving restartability within the data flow is a different kettle of fish completely.

This is because whilst control flow strategies can often be implemented independently of the content of the ETL (and thus are available in generic frameworks), data flow restartability is generally specific to exactly what’s going on within a particular data flow.

Microsoft present ideas for doing this though, in their SSIS Operational and Tuning Guide white paper. This post explores these ideas.

As Microsoft say, the key issue for restarting a data flow after failure is that: “In each data move, you must have some way of knowing what data has already landed at the destination, and what has not.” Ways of doing this are implemented in several different scenarios below.

Loading from a Table to a Table

Restartability in this case can be anchored off a primary key, when one exists that increments indefinitely. The idea is that upon the restart, the control flow finds the highest key value already in the destination, and then runs the data flow with a dynamic source query that reloads source data with only those rows having higher primary key values. (This assumes the data flow is only doing inserts.)

Inherent to this strategy is that the source query must present the data to the data flow ordered ascending by the primary key (which adds no overhead to the RDBMS for a source table that’s already clustered on the primary key).

So the control flow looks like this, where the first 2 Execute SQL tasks between them derive the appropriate starting key value for the source:

DataFlowRestart1a

and where the data flow source query reads from a variable using this expression:

SELECT TransactionID, ProductID, ReferenceOrderID, ReferenceOrderLineID, TransactionDate, TransactionType, Quantity, ActualCost, ModifiedDate FROM Production.TransactionHistory WHERE TransactionID >= ” + (DT_WSTR, 12 ) @[User::sourceStartTransactionID] + ” ORDER BY TransactionID

I engineered a failure in the data towards the end of the load, and the initial run of the data flow, followed by the restart (after fixing the data) looked like:

DataFlowRestart1b

So the second run loaded just the remaining rows, and where the final row count for the destination was correct. Fast Load was being used for the insert, where 1 commit was occurring per buffer, which explains why some data was successfully committed on the first failed run.

Loading from a Single File to a Table

In this scenario, a single file is being loaded into a table, and where the file rows are sorted ascending by an indefinitely incrementing primary key. Unlike in the previous example it’s not possible to read only the rows required for the restart, so they all must be read from the file every time. The destination can still be queried though, to find the maximum primary key value that it has. So the control flow now is just:

DataFlowRestart2a

The data flow – shown here before and after a restart – looks like:

DataFlowRestart2b

The significant transformation here is the conditional split. It has one condition, that sends matching rows to a trash output:

[TransactionID] <= @[User::maxTransactionIDAtDestination]

(The variable referenced here is the one populated by the Execute SQL task in the control flow.)

So upon a restart, all source rows that are already in the destination are filtered out, with the remaining being inserted. The above image shows an initial run which failed after 9956 rows. Once the data was corrected, the restart wrote only the remaining 44 to the destination.

Loading from a Table to Multiple Files

In this scenario, a source table is loaded into multiple text files. The text files are chunked (partitioned) by a suitable source field and the source data is read in order. The example source data here contains transactions that span multiple days and the ETL creates a destination file for each day, with the day’s date included in the file name. This means that when restarting after a failure, the SSIS package can determine which destination files already exist, and then adapt the source query to start extracting data only for the next day onwards.

This is my control flow:

DataFlowRestart3a

I forced an error to occur on the generation of the 9th file. Once I undid this and ran the package again, the target directory showed these files:

DataFlowRestart3b

The timestamps show that the 8 complete files from the initial run were retained, and only the subsequent files were generated on the second run.

There are 3 key expressions used in the package. The Expression Task uses this nested condition to determine the latest file existing from any prior failed run:

@[User::latestDay] = ISNULL(@[User::latestDay]) ? SUBSTRING(@[User::existingTxFileName], 14, 8) : SUBSTRING(@[User::existingTxFileName], 14, 8) > @[User::latestDay] ? SUBSTRING(@[User::existingTxFileName], 14, 8) : @[User::latestDay]

The Execute SQL Task that determines the range of days uses a variable with this expression for its SQL:

“SELECT DISTINCT CAST(CAST(TransactionDate AS DATE) AS VARCHAR(10)) TransactionDate
FROM Production.TransactionHistory_dest ” + (LEN(@[User::latestDay]) == 0 ? “” : ”
WHERE TransactionDate >= ‘” + SUBSTRING(@[User::latestDay], 1, 4) + “-” + SUBSTRING(@[User::latestDay], 5, 2) + “-” + SUBSTRING(@[User::latestDay], 7, 2) + “‘ “) +
“ORDER BY TransactionDate”

The query in the Data Flow OLE DB Source load uses a variable with this expression for its SQL (where TransactionDate is sargable):

“SELECT TransactionID, ProductID, ReferenceOrderID, ReferenceOrderLineID, TransactionDate, TransactionType, Quantity, ActualCost, ModifiedDate FROM Production.TransactionHistory_Dest WHERE TransactionDate >= ‘” + @[User::currentDay] + “‘ AND TransactionDate < DATEADD(d, 1, '" + @[User::currentDay] + "') ORDER BY TransactionID"

Summary

All the techniques above successfully implement data flow restarts, where work that was already done did not have to be repeated. Importantly, the logic for determining where to restart did not have to compare all source rows with all destination rows one-by-one to see which already existed in the destination and which did not. Instead, chunking and/or ordering was used to allow the restart point to be efficiently calculated.

The 3 cases above were relatively simple in that they dealt with new inserts only. Data flows implementing other kinds of updates to the destination would require different approaches. The general way of thinking about the problem described above should still be useful though.