Implementing best data warehouse designs and practices such as data lineage reduces the need to ever have to restore an entire relational data warehouse. However, sometimes there are instances whereby you have inherited poorly designed data warehouse environments that leaves you with no other options but to perform an entire database restore in an event of a sudden disaster. I recently found myself in a similar situation of having to recover one of my data mart following a data integrity issue wherein all data of a type 1 dimension was updated/overwritten using an incorrect source file. In this article I take a look at how different approaches can be utilised to restore the compromised SQL Server-based data mart back to its “good state”.
Likewise, such a topic is better discussed through practical examples. Thus, below are the details of a fictitious data warehouse that I will refer to throughout this article. Figure 1 indicates my fictitious data warehouse, selectSIFISOBlogs2014, as well as a recently restored backup of the same data warehouse labelled selectSIFISOBlogs2014_BAK.
Both selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK contains dimensions and fact objects that relate to each other in a multidimensional star schema, as shown in Figure 2.
The customer dimension is the compromised object that has recently been updated with an incorrect source. For instance, Figure 3-4 compares the two versions of customer dimension, as it can be seen that whilst the version in the selectSIFISOBlogs2014_BAK contains correct customer name values the version in selectSIFISOBlogs2014 has been updated using method of payment (MOP) source file in such a way that customer name values are now set to either CASH or ELECTRONIC.
Another significant change between the data in the two versions of the data warehouse is the amount of data they each possess. Script 1 makes use of an EXCEPT T-SQL clause to identify differences in the fruit dimension between the two data warehouse environments.
The results of Script 1 execution are shown in Figure 5 and they indicate that the fruit dimension in the backup database is missing an entry that contains an apricot fruit.
Not surprisingly, as shown in Figure 6 the results of comparing the FruitSales fact table indicates that the selectSIFISOBlogs2014 database has one more transaction relating to the sale of apricots than its counterpart in the backup database.
Approach 1: Full Data Warehouse Restore
Given the discrepancies observed in the two versions of customer dimensions, the simplest way of restoring this dimension to a previous “good state” is by conducting a full database restore of. Figure 7 shows a restore of selectSIFISOBlogs2014 data warehouse environment using the Restore Database wizard in SQL Server Management Studio (SSMS).
Pros and Cons of a Full Data Warehouse Restore
An obvious benefit of this approach is that it is so simple that you don’t need to depend on a data warehouse team to run with the restore operation as even your typical SQL Server DBA can easily perform this operation. Unfortunately, simplicity of this approach can be its very own disadvantage. For instance, with this approach you are likely to lose any changes (i.e. data changes, table structure changes, etc.) that you would have made after the backup was created. This means that in the case of our fictitious data warehouse we are going to lose the information relating to the apricot transaction.
Therefore, although this approach is easy to implement, it may result into some undesirable outcomes (i.e. loss of data). It is therefore recommended that prior to its implementation that you consult with consumers of your data warehouse environment on the potential consequences of this approach.
Approach 2: Partial Data Warehouse Restore
Instead of restoring the entire database, we can also just restore only the affected customer dimension which will ensure that we avoid data loss in the fruit dimension and fruit sales fact table. Unlike its counterpart, the partial data restore approach involves several steps:
Ensure that both the current version and backup version of databases are online
In our case we need to ensure that selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK databases are online.
Copy data to replace compromised object(s)
In this step you have to restore the selected table from the backup database. This can be done using SQL Server Integration Services (SSIS) packages, or T-SQL. Another method would be to drop the current version of the customer dimension and recreate it using the SSMS Import and Export wizard as shown in Figure 8-9.
Figure 8: Import Data Wizard
Figure 9: Choosing to copy Customer dimension
Pros and Cons of a Partial Data Warehouse Restore
The benefit of this approach is that the restore will only affect selected data warehouse objects and as a result it will not overwrite the entire environment which could be costly. Unfortunately, this approach has several drawbacks too:
Import data wizard creates heaps
To illustrate this point let’s look at the original definition of the customer dimension in Script 2, as it can be seen, it has several constraints that are crucial to the data consistency and integrity of this dimension.123456789101112CREATE TABLE [DIM].[Customer]([CustomerKey] [int] IDENTITY(1,1) NOT NULL,[CustomerCode] [varchar](50) NOT NULL,[CustomerName] [varchar](250) NOT NULL,[InsertDate] [datetime] NOT NULL DEFAULT (getdate()),CONSTRAINT [PK_CustomerKey] PRIMARY KEY CLUSTERED([CustomerKey] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]
Script 2: Customer dimension script with constraints
However, the import wizard led to a creation of a dimension that is missing default constraints and primary key, as shown in Script 3.12345678CREATE TABLE [DIM].[Customer]([CustomerKey] [int] NOT NULL,[CustomerCode] [varchar](50) NOT NULL,[CustomerName] [varchar](250) NOT NULL,[InsertDate] [datetime] NOT NULL) ON [PRIMARY]
Script 3: Customer dimension script without constraints
An obvious work around to this would be to modify the table after the restore and add those missing constraints back to the dimension or fact table. Another workaround, instead of using the import wizard, would be to restore the selectSIFISOBlogs2014_BAK database on the same/linked server as selectSIFISOBlogs2014; drop and recreate the customer dimension and then use T-SQL (which will include temporarily disabling identity insert) to populate the compromised dimension as shown in Script 4.12345678910111213141516SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] ONINSERT INTO [selectSIFISOBlogs2014].[DIM].[Customer]([CustomerKey],[CustomerCode],[CustomerName],[InsertDate])SELECT ([CustomerKey],[CustomerCode],[CustomerName],[InsertDate]FROM [selectSIFISOBlogs2014_BAK].[DIM].[Customer]SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] OFF
Script 4: Insert statement with identity insert
Different resources required
As mentioned earlier this approach is complex and unlike the full restore approach you cannot just rely on a DBA – instead a data warehouse developer will have to be involved to write T-SQL statements or SSIS packages.
In this article we have demonstrated, by using a fictitious data warehouse, the different approaches available to successfully restore a data warehouse environment. The benefits and disadvantages of such approaches were discussed with a recommendation that whenever possible you should inform consumers of your data warehouse environment about any planned changes to the environment.
- Temporarily disable identity column of a table
- Top 10 Best Practices for Building a Large Scale Relational Data Warehouse
- EXCEPT T-SQL Clause