Verifying Minimal Logging for Bulk Load Operations

Andrew Smith Posted on

SQL Server provides several options for bulk loading data, both from other tables and from data files.

In this post I briefly discuss each of them, consider which can be minimally logged, and then demonstrate by looking at transaction log contents when minimal logging really is occurring.

Native bulk load mechanisms are as follows:

  • BCP
  • BULK INSERT …
  • INSERT INTO … SELECT * FROM OPENROWSET(BULK …)
  • INSERT INTO … SELECT * FROM [table]
  • SELECT INTO …

SSIS also has 2 bulk load mechanisms – the Bulk Insert Task and the Fast Load option of the OLE DB Destination. The SQL Server Profiler shows that the former just runs the standard T-SQL BULK INSERT command (discussed below), whereas the latter executes an “insert bulk” call:

bulkloadprofileroledb

Unlike Bulk Insert, Insert Bulk cannot be called directly and is used under-the-covers by bulk insert client library APIs.

I’m going to start off by recording the time taken by each of the 5 native bulk load mechanisms to load 1 million rows into an empty heap-structured table (with no indexes, constraints or triggers) within a simple recovery model database on a SQL Server 2014 server. This isn’t a strict like-for-like comparison as some commands will be loading from files and others will be loading from tables, but it serves as a useful baseline for what follows.

In each case I run CHECKPOINT; DBCC DROPCLEANBUFFERS; prior to the test to ensure I am starting with an empty buffer pool, and I truncate or delete the target table as appropriate.

BCP

BCP can be used from the command line (or from T-SQL via xp_cmdshell if enabled) to both import and export bulk data. Here I’ve used a basic BCP command line to load from a file into a table, with no performance optimisations specified:

bcp tsql2012.dbo.loadtarget in testdata.txt -T -f testdata.fmt

This took 3.3 seconds.

BULK INSERT

BULK INSERT is a T-SQL statement that can be used to import data files. The BULK INSERT equivalent of the prior example is:

BULK INSERT LoadTarget
   FROM ‘C:\BulkLoad\TestData.txt’
   WITH (FORMATFILE = ‘C:\BulkLoad\TestData.Fmt’);

This took 2.2 seconds.

INSERT INTO … SELECT * FROM OPENROWSET(BULK …)

OPENROWSET(BULK …) is very similar to BULK INSERT, with a notable difference being that it returns the file being read as a table. Basic syntax for it is:

INSERT INTO LoadTarget
   SELECT * FROM
   OPENROWSET(BULK ‘C:\BulkLoad\TestData.txt’, FORMATFILE = ‘C:\BulkLoad\TestData.Fmt’) AS T1;

This took 2.7 seconds.

INSERT INTO … SELECT * FROM [table]

This bulk load construct uses a table (or view or table valued function) as its source rather than a file. Again I’m starting with the basic form with no TABLOCK or other optimisations:

INSERT INTO LoadTarget
SELECT * FROM TestData

This took 2.2 seconds.

SELECT INTO

Finally we have the following, which also uses a table as its source but this time creates a new table prior to inserting its rows:

SELECT * INTO LoadTarget
FROM TestData

This was noticeably faster at 1.1 seconds.

Minimal Logging

We can now see whether we can get minimal logging for any or all of these methods. Databases need to be running under either the simple or bulk logged recovery models to support this, and as mentioned above we have the former. But several other conditions are required also, as Microsoft document in their Data Loading Performance Guide.

I’ll be using the query below to check the contents of the transaction log. This query was presented by Microsoft’s Peter Hansen in his SQLBits session here. The fn_dblog() function used is unsupported by Microsoft and so should not be used as part of any production code, but it is suitable for ad-hoc analysis and is widely referenced (including on Microsoft blogs).

DECLARE @xact_id BIGINT;
BEGIN TRAN

— – Bulk Load statement under test – –
INSERT INTO LoadTarget
SELECT * FROM TestData
— – – – – – – – – – – – – – – – – – – –

SELECT @xact_id = transaction_id FROM sys.dm_tran_current_transaction;
COMMIT TRAN;

SELECT [Current LSN], [Operation], [AllocUnitName], [Context], [Transaction ID], [Transaction Name], [Xact ID]
FROM ::fn_dblog(NULL, NULL)
WHERE [Transaction ID] = (
        SELECT TOP 1 [Transaction ID]
        FROM ::fn_dblog(NULL, NULL)
        WHERE [Xact ID] = @xact_id);

Sample output is below, where what we’re really interested in is the number of rows (circled):

bulkloadlogquery

So as it stands, the INSERT INTO used in the query and which took 2.2 seconds to run (above) caused 1,022,226 writes to the transaction log. That’s definitely not minimally logged, which isn’t surprising however since minimally logged inserts into heaps require the TABLOCK hint to be used. So if we modify the query to be:

INSERT INTO LoadTarget WITH (TABLOCK)
SELECT * FROM TestData

and run the script again we get:

bulkloadlogquery2

Phew – quite a difference! 1,022,226 transaction log inserts reduced down to 1,985. The query time reduced from 2.2 seconds to 1.1 seconds also.

We can now look at the other 4 bulk load methods listed above and see if we can get minimally logged loads, and show with confidence that this really is happening.

If we put the SELECT INTO statement into the above script, we get 2,927 rows. So a few more than the last case, but it’s definitely getting minimally logged by default. (Peter Hansen mentions this situation also in his SQLBits presentation mentioned previously.)

Rerunning the BULK INSERT query from above showed 1,005,048 transaction log writes. If we change it to include TABLOCK:

BULK INSERT LoadTarget
   FROM ‘C:\BulkLoad\TestData.txt’
   WITH (FORMATFILE = ‘C:\BulkLoad\TestData.Fmt’, TABLOCK);

this reduces to 1,920 rows. The time taken reduces from 2.2 seconds to 1.3 seconds also.

The OPENROWSET query also reduced from 1,005,048 to 1,920 rows when adding WITH (TABLOCK) to the INSERT, and its time reduced from 2.7 seconds to 1.4 seconds.

Finally, adding the TABLOCK hint to the BCP load:

bcp tsql2012.dbo.loadtarget in testdata.txt -T -f testdata.fmt -h”TABLOCK”

reduced the number of transaction log writes from 1,053,306 to 7,679, and its load time from 3.3 seconds to 1.2 seconds.

Non-Empty Heaps and Heaps with Indexes

Non-empty heaps behave the same way as empty heaps with regards minimal logging, in that the TABLOCK hint will result in minimally logged inserts. Heaps having (nonclustered) indexes result in fully logged inserts though, even with TABLOCK set. (Trace flag 610 can be set to workaround this, although Microsoft’s Data Loading Performance Guide says that for heaps this ultimately “depends on the plan chosen by the optimzer”).

Disabling (or dropping) indexes prior to a heap load and rebuilding them afterwards will allow for minimally logged heap loads. For large tables, partitioning will be a good solution, since partition switching is minimally logged given certain constraints (an aligned index and a check constraint on the partitioning column). A partition to be worked on can be switched out, have its indexes dropped, have new data inserted, have its indexes rebuilt, and then be switched back in again.

Summary

We were able to show here when bulk load mechanisms were using minimal logging and when they weren’t, which is a very useful ability to have when troubleshooting problem loads. This was a deliberately simple setup – using an empty heap with no indexes and using the simple recovery model – since the main aim of this post was to show how to track transaction log usage.

Elements of this setup do have applicability for production situations though. The simple recovery model is suitable for data warehouses that just have nightly loads and otherwise remain read-only, and in the SQLBits presentation previously mentioned, Microsoft’s Peter Hansen recommended the use of heaps for data warehouse fact tables rather than clustered indexes. This latter fact was substantially due to the complexities of clustered indexes with regards achieving minimally logged inserts. (For SQL Server 2014 and later, clustered columnstore indexes are another table representation option for fact tables.)