Hans Michiels

Temporal Table applications in SQL Data Warehouse environments

March 7, 2017 by

Today the subject of investigation is the Temporal Table, which is a new feature in SQL Server 2016. My focus will slightly be on how to use it in Data Warehouse environments, but there is some general information passing by as I write.

I want to cover next topics:

  1. What is a temporal table (in short)?
  2. Can I use a temporal table for a table in the PSA (Persistent Staging Area)?
  3. Can I use a temporal table for a Data Vault Satellite?
  4. Is using temporal tables for full auditing in an OLTP system a good idea?

What is a temporal table (in short)?

In short, a temporal table is a table that tracks changes in a (temporal) table to a second “History” table, and this process is managed by SQL Server 2016 itself, you do not have to write extra code for that.

Stated in the free eBook Introducing Microsoft SQL Server 2016 is the following:

“When you create a temporal table, you must include a primary key and non-nullable period columns, a pair of columns having a datetime2 data type that you use as the start and end periods for which a row is valid.”

For more details please read the book.

You can also read more about temporal tables on MSDN.

An example of how a temporal table looks in SQL Server Management Studio. You can see by the icon and the suffix (System Versioned) (both marked red) that this is a temporal table. Underneath the table node is history table is shown (marked with green). Start- and enddatetime are required columns, but you can give them a custom name (marked blue).

There are three ways to create the history table for a temporal table:

  1. Create a temporal table with an anonymous history table: you don’t bother really about the history table, SQL Server gives it a name and creates it.
  2. Create a temporal table with a default history table: same as anonymous, but you provide the name for the history table to use.
  3. Create a temporal table with an existing history table: you create the history table first, and can optimize storage and/or indexes. Then you create the temporal table and in the CREATE statement provide the name of the existing history table.

So the first two are the “lazy” options, and they might be good enough for smaller tables. The third option allows you to fully tweak the history table.

I have used the third option in my Persistent Staging Area, see below.

Can I use a temporal table for a table in the PSA (Persistent Staging Area)?

In my previous blog post – Using a Persistent Staging Area: What, Why, and How – you could read what a Persistent Staging Area (or PSA for short) is.

Today I want to share my experiences on my lab tests using temporal tables in the PSA.

But besides a temporal table, I have also created a “normal” staging table, for loading the data. This is because:

  1. A temporal table cannot be truncated, and because truncate is much faster than delete, I create a normal staging table to load the data from the source.
  2. I want load the source data as fast as possible, so I prefer plain insert instead of doing change detection with the rows currently in the temporal table. This would be slower, and I preferably do that later in parallel with loading the rest of the EDW.
  3. Because I want the PSA to stay optional and not a core part of the EDW. If the PSA is additional to a normal Staging Area, it is easier to switch off later.

Here is the script I used to create the temporal table:

Note: I have not included all the scripts that I used for my test in this article, because it could be overwhelming. But if you are interested you can download all the scripts and the two SSIS testpackages here.

Maybe you are as curious as me to know if using temporal tables for a PSA is a good idea.

Considerations are the following:

  • Speed of data loading.
  • Speed of reading rows at a certain moment in time (time travel mechanism).
  • Ability to adopt changes in the datamodel.
  • Simplicity and reliability of the solution
  • Ability to do historic loads, for instance from archived source files or an older data warehouse.

And of course we need something to compare with. Let that be a plain SQL Server table with a start- and enddatetime.

Before I present you the testresults, I just want to tell a few details about the test:

  • For testing I use a “Customer” table that is filled with half a million rows of dummy data.
  • I simulate 50 daily loads with deletes, inserts and updates in the staging table. After those 50 runs, the total number of rows has more than quadrupled to just over 2.2 million (2237460 to be exactly).
  • For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.
  • For change detection I use a [RowHash] column, which is a MD5 hashvalue of all columns of a row that are relevant for a change (so start- and enddate are not used for the hashvalue). This is done primarily for having a better performance while comparing new and existing rows.
  • I have compared all data in both the Non temporal PSA table and the temporal table with backing history table to check that the rows where exactly the same and this was the case (except for Start- and Enddates).

Speed of data loading

Using T-SQL for synchronizing data from a staging table to a PSA table I got the following testresults:

 Testcase Average duration (50 loads, in ms)
 Synchronize PSA temporal 6159
 Synchronize PSA Non-temporal 24590

Data loading test results

So we have a winner here, it’s the temporal table! It’s four times faster!

Speed of reading rows at a certain moment in time (time travel mechanism)

For reading, I used two views and two SSIS Packages with a time travel mechanism and a data flow task.

The views return the rows valid at a certain point in time, selected from the temporal and non-temporal history table, respectively.

The data flow tasks in the SSIS packages have a conditional split that is used to prevent that the rows actually are inserted into the OLE DB Destination. In this way it is a more pure readtest.

Here are the views that were used:


Example of one of the two SSIS Packages that were used for the readtest.

For measuring the duration I simply used my logging framework (see A Plug and Play Logging Solution) and selected the start- and enddatetime of the package executions from the [logdb].[log].[Execution] table.

Here are the results:

 Testcase Total duration (50 reads, in seconds)
 Read PSA temporal 164
Read PSA Non-temporal 2686

And again, a very convincing winner, it is the temporal table again. It is even 16 times faster! I am still wondering how this is possible. Both tables have similar indexes, of which one columnstore, and whatever I tried, I kept getting the same differences.

Ability to adopt changes in the datamodel

Change happens. So if a column is deleted or added in the source, we want to make a change in the PSA:

  • if a column is deleted, we make keep in the PSA to retain history (and make it NULLABLE when required).
  • if a column is added, we also add a column.

I have tested the cases above plus the deletion of a column for the temporal table.

And yes, this works. You only have to change the temporal table (add, alter or drop column), the backing history table is changed automaticly by SQL Server.

There are however a few exceptions when this is not the case, e.g. IDENTITY columns. You can read more about this on MSDN.

Temporal table with added column “CreditRating”: when added to the temporal table the column is also automaticly added to the backing history table. (I removed some other columns from the picture for simplicity)

But the conclusion is that a temporal table structure can be changed when needed. This is what I wanted to know.

Simplicity and reliability of the solution

Unless you use code generation tools that generate the loading process for you, and the code that comes out is thoroughly tested, I would say the code to keep track of changing using a temporal table is less complex and thus less prone to errors. Especially the enddating mechanism is handled by SQL Server, and that sounds nice to me.

There is however also a disadvantage of using a temporal table: the start- and end-datetime are out of your control, SQL Server gives it a value and there is nothing you can do about that. For Data Vault loading it is a common best practice to set the LoadDts of a Satellite to the same value for the entire load and you could defend that this would also be a good idea for a PSA table.

But, as you might have noticed, my solution for that is to just add a SessionStartDts to the table in addition to the start and end Dts that SQL Server controls. I think this is an acceptable workaround.

By the way, SQL Server always uses the UTC date for start and end datetimes of a temporal table, keep that in mind!

Ability to do historic loads

For this topic I refer to Data Vault best practices again. When using Data Vault, the LoadDateTimeStamp always reflects the current system time except when historic data is loaded: then the LoadDateTimeStamp is changed to the value of the (estimated) original date/time of the delivery of the datarow.

This can be a bit problematic when you use a PSA with system generated start and end dates, at least that is what I thought for a while. I thought this was spoiling all the fun of the temporal table.

But suddenly I realized it is not!

Let me explain this. Suppose you have this staging table SessionStartDts (or LoadDts if you like) for which you provide the value.

Besides that you have the EffectiveStartDts and EffectiveEndDts (or whatever name you give to these columns) of the temporal table that SQL Server controls.

Be aware of the role that both “timelines” must play:

  • The columns that SQL Server controls are only used to select the staging rows at a point in time. They are ignored further down the way into the EDW.
  • The manually set SessionStartDts, which can be set to a historic date/time, is used further down the way into the EDW to do the enddating of satellites and so on.

How this would work? As an example a view that I used for the readtest, which contain both the [SessionStartDts] and the [PointInTimeDts] (for technical reasons converted to VARCHAR). The math to get the right rows out works on the ‘technical timeline’ (SQL Server controlled columns), while the [SessionStartDts] is available later for creating timelines in satellites.

Views used for time travelling

Drawing a conclusion about using temporal tables for a PSA

 Consideration And the winner is ..
 Speed of data loading Temporal table (4 times faster!)
 Speed of reading rows at a certain moment in time (time travel mechanism) Temporal table (16 times faster!)
 Ability to adopt changes in the datamodel Ex aequo (only in exceptional cases changing the temporal table is more complex).
  Simplicity and reliability of the solution Temporal table.
 Ability to do historic loads Ex aequo, if you know what you are doing.

I think there are enough reasons for using temporal tables for a PSA! Do you agree?

Can I use a temporal table for a Data Vault Satellite?

Due to the similarities with a table in the Persistent Staging Area, I think those test results on read- and write performance also hold true for satellites.

However in satellites you cannot get away with the system generated start- and enddatetimestamps when you have to deal with historic loads, unless you do serious compromises on the technical design.

What does not work is removing SYSTEM_VERSIONING temporarily ( ALTER TABLE [psa].[Customer_temporal] SET (SYSTEM_VERSIONING = OFF)) and update the dates then. Because the columns are created as GENERATED ALWAYS this is not allowed.

Besides that, this would be a clumsy solution that still requires manual management of the timeline in a more complex way than when normal satellites were used!

So that leaves only one other solution, which requires – as said – a serious compromise on the technical design.

If you make the use of point in time tables mandatory for every hub and its satellites, you could decouple historical and technical timelines. Using a similar mechanism as the view for time travelling, you could attach the point in time date 2013-02-02 (historical) to the EffectiveStartDts (or LoadDts if you like) of 2017-02-28 06:34:42.98 (technical date from temporal table) of a certain satellite row.

And .. if you follow the holy rule that the Business Vault (in which the point in time tables exist) should always be rebuildable from the Raw Vault, you must also store the historical Startdate as an additional attribute in the satellite, but you exclude it for change detection.

Is it worth this sacrifice in order to be able to use temporal tables?

I don’t know, “it depends”. It feels like bending the Data Vault rules, but at least it can be done, keep that in mind.

Is using temporal tables for full auditing in an OLTP system a good idea?

When auditing is needed due to legislation or internal audit requirements, I certainly think it is a good idea to use temporal tables. They are transparent to front end applications that write to the database and the performance seems quite okay (see above). Obviously the performance will always be a bit worse than non-temporal tables in an OLTP scenario, but that is not unique for temporal tables. Every solution to track history will cost performance.

Conclusion / Wrap up

In this article I discussed some possible applications for the temporal table, a new feature in SQL Server 2016.

And it can be used for PSA (Persistent Staging Area) tables, Data Vault Satellites and tables in OLTP systems. If you know what you are doing, temporal tables can be of great value. That’s at least what I think.

Resources on the web

And again. if you are interested you can download all scripts and SSIS Packages used for my test here, also the ones not published inline in this article.

 
Hans Michiels

Hans Michiels

Hans is an Independent Business Intelligence and Data warehouse Consultant, working in the Netherlands.

He works in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on data warehouse and business intelligence projects using Microsoft technology, preferably a Data Vault and Kimball architecture.

He has a special interest in Data warehouse Automation and Metadata driven solutions.

* Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0)

* Certified in MS SQL Server: MCSA (Microsoft Certified Solutions Associate) SQL Server 2012, - MCITP Business Intelligence Developer 2005/2008, MCITP Database Developer 2005/2008, MCITP Database Administrator 2005/2008

His web site and blog is at www.hansmichiels.com, where you can find other contact and social media details.

View all posts by Hans Michiels
Hans Michiels
SSIS packages

About Hans Michiels

Hans is an Independent Business Intelligence and Data warehouse Consultant, working in the Netherlands. He works in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on data warehouse and business intelligence projects using Microsoft technology, preferably a Data Vault and Kimball architecture. He has a special interest in Data warehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: MCSA (Microsoft Certified Solutions Associate) SQL Server 2012, - MCITP Business Intelligence Developer 2005/2008, MCITP Database Developer 2005/2008, MCITP Database Administrator 2005/2008 His web site and blog is at www.hansmichiels.com, where you can find other contact and social media details. View all posts by Hans Michiels

392 Views