The title of this post came from Matt Masson’s SSIS session at SQLBits this year, when he used it whilst describing his high level strategy for designing SSIS solutions for performance; namely (i) Do Less Work, and (ii) Parallelise.
“Do Less Work” is in fact a useful ethic to abide by in many aspects of SSIS project design & implementation, which I discuss in this post.
Doing less work with a view to maximising performance is to a significant degree project specific, since it’s a matter of designing ETL algorithms efficiently in order to meet the requirements at hand. But in addition there are many elements of SSIS best practice which also revolve around doing less work, e.g.:
- Loading a Lookup cache with only the rows required, and no more
- Only pulling into the data flow the columns required from a source, and no more
- Doing set-based SQL updates rather than row-by-row SSIS operations
- Configuring the database and its update operations to allow for minimal logging for bulk updates (where possible)
Whilst the notion of auditability in ETL spans a whole gamut of possibilities, one element can be the need for an audit trail that relates destination data back to source data, i.e. establishing the path for each processed data item through any number of transformations.
Whilst the need for an ETL audit trail may initially come from audit or compliance requirements, it’s also an extremely useful mechanism to have whilst testing an ETL process. Testing processes that change databases is difficult, and trying to find the origin of an unexpected destination value can be very difficult, particularly with big and/or complicated loads. Having an audit trail to trace back along can be a big time saver in these situations – an example of doing less work during testing, and which is equally applicable during production support thereafter.
This can result in doing less work during troubleshooting in the same way that having an audit trail does. Having some key logging facts at your fingertips after an unexpected failure can mean far less theorising and false starts at resolutions compared with having to work blind or almost blind.
SSIS catalog logging can go a log way to help here, but some application-specific custom logging may also be required to supplement this.
One of the fundamental reasons for employing restartability is, at runtime, to do less work. I.e. if a package fails partway through, then if it’s been designed for restartability it can pick up approximately where it left off, without having to redo work that has already been successfully completed.
There can be a difference between the data that source systems should contain and what they actually do contain. By profiling representative samples of source data ahead of starting SSIS package design, additional transformations (e.g. cleansing) may be identified over and above the transformation requirements provided by business analysts or source system end users. Designing these transformations into the ETL process from the beginning results in doing less work compared with retrofitting them later.
This is about code reuse, where being able to use an existing module for a new task results in doing less work. As with much of the advice in this post it requires initial work up-front in order to achieve work saving and time saving downstream. The essence is to keep packages small and focused on specific tasks, and where if there’s some chance of the functionality being useful again elsewhere, to parameterise the package as appropriate, e.g. by passing in connection strings and directory paths as package-level parameters.
Additionally, having connection strings and directory paths as parameters that are passed into the project as a whole not only means less work at design time than smattering the same strings over multiple places, but it obviously results in doing less maintenance work in the future if servers change and packages then need to run against different connections.
Well written SSIS packages result in doing less work because it’s much easier & quicker for a new developer working on the package to understand what’s going on, as well as with helping the original developer who may need to return to it weeks or months after they originally wrote it.
Having documentation within the SSIS package – implicit or explicit – is important in many environments, since the alternative of isolating it in electronic or hardcopy documents can be of little use if developers don’t read them or don’t even know they exist. Documents are essential for some reasons – e.g. detailing requirements or mapping out the staging area – but can be less useful with regards package logic.
For SSIS, good package implementation for understandability will include:
- Using annotations within the control and data flows
- Using a naming convention for task and transformation names; this particularly helps identify what they do when looking at log output. Add descriptions to tasks and transformations where meaningful also
- Keeping packages focused and small in scope, i.e. avoiding spaghetti packages. Keeping the control flow small enough to fit on the screen without little or no scrolling is one useful guideline
- Using sequence containers in the control flow to organise tasks into logical units of work
- Scoping SSIS variables as narrowly as possible – this is implicit documentation that says “this variable is only used by this container and nowhere else”
- Keeping algorithms & SQL minimal and declarative, e.g. via T-SQL window functions or the T-SQL Merge statement
A mantra I live by as a software developer is “Write Less Code”, which falls very neatly under the “Do Less Work” umbrella. Through a methodical and thorough understanding of the requirements and the target environment, an elegant and minimal algorithm often pops out.
Whilst SSIS is primarily a GUI driven environment (excepting custom scripting), there’s a degree of SQL coding required in the Execute SQL Task. A great example of writing less code and doing less work here is through the use of window functions (for T-SQL, or else the equivalent analytic functions for Oracle). The reduction in code and the improvement in efficiency when using these functions can be staggering, not to mention the benefits gained from improved readability & maintainability.
The emphasis here is on testing units – discrete components of a package such as individual control flow tasks – rather than testing a package or set of packages as a whole. Whilst this generates more work up front during development, over the course of a project’s lifetime it will almost certainly result in doing less work, especially if the unit tests are automated (see here for an example of how this can be done).
There’re so many reasons for using source control, but the ability to:
- quickly compare different revisions of code (even if it is the SSIS internal XML format)
- quickly find out which developer made a particular change
- quickly revert to prior versions of a project when required
- prevent accidental concurrent updates to the same package
- easily retrieve repository contents when setting up a new development PC
all result in doing less work.