Dinesh Asanka
List of database snapshots in SQL Server to perform ETL using database snapshots.

Incremental Data Extraction for ETL using Database Snapshots

January 10, 2022 by

This article is to explain how to perform ETL using database snapshots and how to utilize database snapshots to perform incremental data extraction from an operational database to a data warehouse.

Introduction

As you are aware, ETL (Extract-Transform-Load) operations are one of the most challenging tasks in data warehousing due to many reasons. In the ETL, one of the most difficult tasks is identifying the incremental data set from the sources. In the previous article, Use Replication to Improve the ETL Process in SQL Server, we looked at how to use replication in order to improve the ETL process in the Datawarehouse.

Following is the schematic diagram of the data warehouse and it shows how ETL bridge the operational data sources to the Data warehouse in order to perform various types of data analysis such as OLAP Cubes, Dashboards, Reports, Pivot Tables etc.

Schema diagram of Data warehouse.

As you know, data warehouses extract data from multiple sources that can be databases or any other sources. Since ETL runs in a batch mode, typically once a day, it is important to identify the data that was modified between the last run to the new run. As you know, most of the operational databases will go through many changes and it is essential to extract these modifications that include inserts, updates and deletes.

There are many different options for incremental data extraction depending on the different data sources and we will be discussing how incremental data extraction can be done on SQL Server databases as operational data sources. In this article, we will be specifically discussing how to use Database snapshots in operational data sources for ETL to extract incremental data.

How Database Snapshot works

Before we discuss the implementation of ETL using Database Snapshots, let us see how the database snapshot works and its architecture.

As the name indicates, the database snapshot gives you a state of a database at a given time. So let us see how to create a database snapshot in SQL Server. As of now, there is no user interface provided for creating a database snapshot in SQL Server and you have to use the following script.

In the above script, the TrnsactionSystem is the operational database while the ssTransactionSystem_dbss20211017 is the snapshot database name and this is the database name that you will be calling. FILENAME provides the path for the sparse file. As soon as the above script is executed, there empty spare file will be created as shown in the following figure.

Read operation before 1st page copied to snapshot

Source: MSDN

When there are modifications to the source operational database, the original data page will be moved to the sparse file before the modification as shown in the below figure.

Copy-on-write operation in database snapshot

Source: MSDN

When you read from the database snapshot before modified data is read from the sparse file while the non-modified data will be read from the original data source. By doing so, you can get the data as per the last time of the creation of the database snapshot. These snapshots are available under the snapshot node.

List of database snapshots in SQL Server to perform ETL using database snapshots.

As you can see in the above figure, database snapshots are available and we can query from this database as you do from a typical database without understanding the architect of the database snapshots.

The above query shows the list of databases along with what are the snapshot databases and their source databases.

List of databases including database snapshots.

As you can see, two database snapshots are created from the TransactionSystem database.

Implementation of ETL using Database Snapshots

As in most cases, ETL is a batch mode and often it is executed at midnight. In the proposed method, the ETL process is started after the creation of the database snapshot and the following is the script for it.

The above script will drop if the database snapshot is available and create a new database snapshot. In case you need to schedule ETL more frequently, you need to modify the above query slightly.

Next is to find out how to extract the data for ETL using Database Snapshots. Since we have the snapshot of the previous day and today, it is a matter of comparing two databases as shown in the below script.

As you can see, you can easily detect the inserts and deletes as shown in the following figure.

ETL using Database Snapshots

Similarly, you can find the modifications as well. This data can be moved to a staging environment and then perform necessary Transform and Load operations.

Advantages

One of the major advantages of ETL using Database snapshots is that the duration of the database snapshot creation is extremely negligible as you are simply creating an empty file. This means that whether it is a large or small operational database, the time taken to create a database snapshot is very minimum. Therefore, when we used ETL using Database snapshots do not have an impact on the time duration.

Another important factor with the database snapshot is that, while reading data from the database snapshot, there won’t be any locking on the source data. This is a very vital feature as ETL deals with a large volume of data and typically there will be an additional load on the source data if you directly read them.

Since database snapshots can be created on top of mirror databases, ETL using Database Snapshots can be used on the mirror databases. This is a great option as we can utilize the idle mirror databases for ETL.

Limitations

Though there are important features for ETL using database snapshots, there are a few limitations that you need to worry about. One of the important limitations is that the database snapshot is tightly attached to the source database. Hence you cannot detach the database snapshot from the operational database. This can be visible from the architecture of the database snapshots. This means that while reading data from the database snapshot will lead to additional server performance that will impact the source system.

This mechanism needs at least two database snapshots. In case you did not drop the previously created database snapshot, there will be a performance impact on the operational system as when there are data changes source system has to update all the sparse files of the database snapshots.

Database snapshots have more limitations where they cannot be applied when file stream is enabled in the database. Further, if your database is configured with Full-Text search, again you are unable to configure the database snapshots.

If you are using a database snapshot before the SQL Server 2016 version, you have to use SQL Server enterprise edition. However, from SQL Server 2016 to SQL Server 2019, database snapshots are supported not only in the enterprise edition but also standard, web and even for the Express edition of SQL Server.

Conclusion

In this article, we have discussed a new method to extract data from operational databases. In this article, we have discussed how to use ETL using Database snapshots in order to improve the ETL process. Though you are not able to remove the database snapshot from the source database, in order to extract modified data between the last batches, database snapshots can be used.

Dinesh Asanka
ETL

About Dinesh Asanka

Dinesh Asanka is MVP for SQL Server Category for last 8 years. He has been working with SQL Server for more than 15 years, written articles and coauthored books. He is a presenter at various user groups and universities. He is always available to learn and share his knowledge. View all posts by Dinesh Asanka

168 Views