The Logical Data Map

Andrew Smith Posted on

The logical data map is a document that shows how source fields map to target fields within an ETL process.

In this post I discuss a template for the logical data map that’s presented by Ralph Kimball in his book The Data Warehouse ETL Toolkit. It’s oriented around star schema data warehouse ETL, but can easily be adapted for other ETL environments.

The logical data map (LDM) is a high level specification for the ETL team or developer and thus should exist before ETL development starts. It can be produced once the data warehouse schema is close to finalisation and once source systems have been identified to feed the data warehouse. In addition to providing a business specification for the ETL development, it can also be used by QA to help validate the ETL.

Before creating the LDM, some data profiling should have occurred on the source systems. This helps define the transformations required, adding to the business knowledge of what data content the source systems ‘should’ contain.

Given the above, established business logic for transforming from the source(s) to the target(s) can be combined with logic for dealing with data anomalies found during profiling. A final input into the transformations required can be the need to conform multiple sources for a single dimension or fact table, so that the end product is consistent regardless of origination. (Matching/deduplication may well be required also in this case, in addition to the data transformations.)

I’ve created a version of Kimball’s LDM template here, with some rows completed.

Although quite simple, there’s a lot of information here that can be digested relatively easily. (I added the 2 “Constraint” columns – Kimball didn’t have them but I find them useful.)

It spells out for example if:
  • truncations are possible whilst transforming across string types of different lengths
  • if data is going from a column that allows NULLs to one that doesn’t, such that default values may need to be added
  • that type conversions are taking place, e.g. from floating point decimals to exact numeric types, or from strings to dates

The example rows I’ve filled in use only simple transformations, whereas in many projects some of the transformations will be more complicated. E.g. consider a freeform text field in a source system that dates have been entered into using any of “2016-07-27”, “27th July ’16” or “27 Jul 2016” etc., and where this needs to be transformed to a target field of type date.

Note that the LDM transformations are confined specifically to how a source data item needs to change to become a target data item. The transformations here do not include anything about the route by which that change takes place. So it’s not the role of the LDM to document that the two source dimensions in my example may first need to be staged to an initial single dimension, so that deduplication can take place. Similarly it also doesn’t include the process by which slowly changing dimensions are updated, how data items that fail cleansing are handled, or how late arriving dimensions or facts are dealt with. Specification of the ETL pipeline belongs in another document.

In summary the Logical Data Map is a very useful concept for explicitly stating the mapping between columns in a source system and columns in a destination system. It will be useful to project team members of many roles, including business analysts, data architects, project managers, ETL developers and QA personnel.