Auditing ETL Processes

Andrew Smith Posted on

In the context of ETL, support for auditing can be driven by a variety of requirements.

In some cases a complete audit trail of all changes made to a data item may be required, such that the path from source to destination can be interrogated in full. In others, a simple check that the number of rows extracted from the source system matches the number of rows loaded into the destination may be sufficient.

In Tim Mitchell’s post on the subject, he generalises audit requirements into 2 categories:
  • Check for data anomalies beyond simply checking for hard errors
  • Capture and store an electronic trail of any material changes made to the data during transformation

In this post, I’m going to explore a concrete implementation of the second category above, by extending an idea detailed by Ralph Kimball & Joe Caserta in their book The Data Warehouse ETL Toolkit.

Kimball states that one deliverable from the data cleansing process should be the Error Event Table. In actuality this is a star schema that contains a fact table with one row per cleansing exception, and several dimension tables that provide context for these facts. So if one run of an ETL process does 10 cleansing checks and each check finds 10 records needing work, then 100 rows will be written to the fact table.

His particular example is centred upon cleansing checks that just pass or fail a source data item, without attempting to transform it, so I’ve extended the overall idea to accommodate transformations that could be happening during the cleansing phase to rectify bad data. (I’ve made some simplifications also, so as not to obfuscate the core point here.) My version is:

ErrorEventSchema

So ErrorEventFact has one row per rejected or modified data item, BatchDim has one row per run of the ETL process, and TransformDim has one row per cleansing check or transformation that exists in the ETL process. SourceSystemDim and TableDim identify the source system and table within the source system that the check or transformation refers to (these can be amended for file-based sources).

ErrorRowID within ErrorEventFact is anything that can identify a row in the source, e.g. a primary key. The BeforeValue and AfterValue columns contain the state of a data item within the row, pre- and post- transformation (or are left blank for simple pass/fail ETL checks).

Clearly lineage is important here – whilst on its own the above has value in making transparent the effects of an ETL process, a full audit trail will not exist if the final data produced by the ETL cannot be mapped back to information in the error event schema. The ErrorRowID thus needs to flow forward through the ETL in some form, e.g. appearing as lineage columns in the destination. The ProcessingStep column in TransformDim provides additional support for the audit trail, since it’s a numeric ordering of the sequence of transforms within each ETL stage (where parallel transforms can share the same step number).

Clearly the above may be overkill in many situations, and could easily end up complicating logic for little value. For example using LTRIM() and RTRIM() in an SQL source query or staged update query is a simple, minimal and efficient way of removing excess spaces. Unless audit requirements demand it, there would be little reason to add logic to this operation to find which rows would be affected by the trimming and to log corresponding events in the Error Event schema.

When it does fit the bill, it would be easy to design into a solution, at least for SSIS. The control flow could begin and end with Execute SQL tasks that initially create a new BatchDim row, put the BatchID in a variable and then update it with timestamps and row counts at the end. The data flows then just need Conditional Splits that perform the cleansing checks, where failed rows go down a path that insert an ErrorEventFact row and perform any transformation required. Each cleansing check needs to know its own TransformKey, but aside from that everything else for the insert can be determined on-the-fly.

In summary the Error Event schema is a clear and comprehensive approach for logging significant changes to data, that will support many kinds of checks and queries regarding the processing and output of the ETL. This applies both to formal audit situations as well as dealing with support queries, testing and debugging.

Data Quality Improvement

Aside from providing audit support, the Error Event schema can also be used to monitor source data quality within an organisation, and thus be pivotal in driving quality improvement. It easily supports queries such as:

  • Which source systems are generating the most errors?
  • Which ETL checks are generating the most errors?
  • Are there particular source values that are commonly failing checks?
  • Are there more errors on certain days of the week or times during the year?
  • Is the number of source system errors increasing or decreasing over time?

In an ideal world the results from such analysis would be used to improve the quality of data within the source systems. This not only benefits the source systems, but also reduces the long term burden on the ETL and makes the source-to-destination mapping more transparent (and easier to audit), through less error-related transformations taking place.