Two methods for restoring a data warehouse/data mart environment

Implementing best data warehouse designs and practices such as data lineage reduces the need to ever have to restore an entire relational data warehouse. However, sometimes there are instances whereby you have inherited poorly designed data warehouse environments that leaves you with no other options but to perform an entire database restore in an event of a sudden disaster. I recently found myself in a similar situation of having to recover one of my data mart following a data integrity issue wherein all data of a type 1 dimension was updated/overwritten using an incorrect source file. In this article I take a look at how different approaches can be utilised to restore the compromised SQL Server-based data mart back to its “good state”.

Background

Likewise, such a topic is better discussed through practical examples. Thus, below are the details of a fictitious data warehouse that I will refer to throughout this article. Figure 1 indicates my fictitious data warehouse, selectSIFISOBlogs2014, as well as a recently restored backup of the same data warehouse labelled selectSIFISOBlogs2014_BAK.

Figure 1: Fictitious Data Warehouse

Both selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK contains dimensions and fact objects that relate to each other in a multidimensional star schema, as shown in Figure 2.

Figure 2: Multidimensional Start Schema

The customer dimension is the compromised object that has recently been updated with an incorrect source. For instance, Figure 3-4 compares the two versions of customer dimension, as it can be seen that whilst the version in the selectSIFISOBlogs2014_BAK contains correct customer name values the version in selectSIFISOBlogs2014 has been updated using method of payment (MOP) source file in such a way that customer name values are now set to either CASH or ELECTRONIC.

Figure 3: Customer Dimension in selectSIFISOBlogs2014_BAK

Figure 4: Customer Dimension in selectSIFISOBlogs2014

Another significant change between the data in the two versions of the data warehouse is the amount of data they each possess. Script 1 makes use of an EXCEPT T-SQL clause to identify differences in the fruit dimension between the two data warehouse environments.

SELECT [FruitKey]

,[FruitName]

,[InsertDate]

FROM [selectSIFISOBlogs2014].[DIM].[Fruit]

EXCEPT

SELECT [FruitKey]

,[FruitName]

,[InsertDate]

FROM [selectSIFISOBlogs2014_BAK].[DIM].[Fruit]

Script 1: Comparison of Fruit Dimensions

The results of Script 1 execution are shown in Figure 5 and they indicate that the fruit dimension in the backup database is missing an entry that contains an apricot fruit.

Figure 5: Script 1 Execution Results

Not surprisingly, as shown in Figure 6 the results of comparing the FruitSales fact table indicates that the selectSIFISOBlogs2014 database has one more transaction relating to the sale of apricots than its counterpart in the backup database.

Figure 6: Comparison of Fact Tables

Approach 1: Full Data Warehouse Restore

Given the discrepancies observed in the two versions of customer dimensions, the simplest way of restoring this dimension to a previous “good state” is by conducting a full database restore of. Figure 7 shows a restore of selectSIFISOBlogs2014 data warehouse environment using the Restore Database wizard in SQL Server Management Studio (SSMS).

Figure 7: SQL Server Restore Database Wizard

Pros and Cons of a Full Data Warehouse Restore

An obvious benefit of this approach is that it is so simple that you don’t need to depend on a data warehouse team to run with the restore operation as even your typical SQL Server DBA can easily perform this operation. Unfortunately, simplicity of this approach can be its very own disadvantage. For instance, with this approach you are likely to lose any changes (i.e. data changes, table structure changes, etc.) that you would have made after the backup was created. This means that in the case of our fictitious data warehouse we are going to lose the information relating to the apricot transaction.

Therefore, although this approach is easy to implement, it may result into some undesirable outcomes (i.e. loss of data). It is therefore recommended that prior to its implementation that you consult with consumers of your data warehouse environment on the potential consequences of this approach.

Approach 2: Partial Data Warehouse Restore

Instead of restoring the entire database, we can also just restore only the affected customer dimension which will ensure that we avoid data loss in the fruit dimension and fruit sales fact table. Unlike its counterpart, the partial data restore approach involves several steps:

Ensure that both the current version and backup version of databases are online

In our case we need to ensure that selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK databases are online.
Copy data to replace compromised object(s)

In this step you have to restore the selected table from the backup database. This can be done using SQL Server Integration Services (SSIS) packages, or T-SQL. Another method would be to drop the current version of the customer dimension and recreate it using the SSMS Import and Export wizard as shown in Figure 8-9.

Figure 8: Import Data Wizard

Figure 9: Choosing to copy Customer dimension

Pros and Cons of a Partial Data Warehouse Restore

The benefit of this approach is that the restore will only affect selected data warehouse objects and as a result it will not overwrite the entire environment which could be costly. Unfortunately, this approach has several drawbacks too:

Import data wizard creates heaps

To illustrate this point let’s look at the original definition of the customer dimension in Script 2, as it can be seen, it has several constraints that are crucial to the data consistency and integrity of this dimension.

CREATE TABLE [DIM].[Customer](

[CustomerKey] [int] IDENTITY(1,1) NOT NULL,

[CustomerCode] [varchar](50) NOT NULL,

[CustomerName] [varchar](250) NOT NULL,

[InsertDate] [datetime] NOT NULL DEFAULT (getdate()),

CONSTRAINT [PK_CustomerKey] PRIMARY KEY CLUSTERED

(

[CustomerKey] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

Script 2: Customer dimension script with constraints

However, the import wizard led to a creation of a dimension that is missing default constraints and primary key, as shown in Script 3.

CREATE TABLE [DIM].[Customer](

[CustomerKey] [int] NOT NULL,

[CustomerCode] [varchar](50) NOT NULL,

[CustomerName] [varchar](250) NOT NULL,

[InsertDate] [datetime] NOT NULL

) ON [PRIMARY]

Script 3: Customer dimension script without constraints

An obvious work around to this would be to modify the table after the restore and add those missing constraints back to the dimension or fact table. Another workaround, instead of using the import wizard, would be to restore the selectSIFISOBlogs2014_BAK database on the same/linked server as selectSIFISOBlogs2014; drop and recreate the customer dimension and then use T-SQL (which will include temporarily disabling identity insert) to populate the compromised dimension as shown in Script 4.

SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] ON

INSERT INTO [selectSIFISOBlogs2014].[DIM].[Customer]

([CustomerKey]

,[CustomerCode]

,[CustomerName]

,[InsertDate])

SELECT ([CustomerKey]

,[CustomerCode]

,[CustomerName]

,[InsertDate]

FROM [selectSIFISOBlogs2014_BAK].[DIM].[Customer]

SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] OFF

Script 4: Insert statement with identity insert

Different resources required

As mentioned earlier this approach is complex and unlike the full restore approach you cannot just rely on a DBA – instead a data warehouse developer will have to be involved to write T-SQL statements or SSIS packages.

Conclusion

In this article we have demonstrated, by using a fictitious data warehouse, the different approaches available to successfully restore a data warehouse environment. The benefits and disadvantages of such approaches were discussed with a recommendation that whenever possible you should inform consumers of your data warehouse environment about any planned changes to the environment.

Sifiso Ndlovu

Sifiso is Data Architect and Technical Lead at SELECT SIFISO – a technology consulting firm focusing on cloud migrations, data ingestion, DevOps, reporting and analytics. Sifiso has over 15 years of across private and public business sectors, helping businesses implement Microsoft, AWS and open-source technology solutions. He is the member of the Johannesburg SQL User Group and also hold a Master’s Degree in MCom IT Management from the University of Johannesburg.

Sifiso's LinkedIn profile

View all posts by Sifiso W. Ndlovu

Background

Approach 1: Full Data Warehouse Restore

Approach 2: Partial Data Warehouse Restore

Conclusion

Reference

Related posts: