Dinesh Asanka
Sample data output for Slowly Changing Dimension of DimEmployee

Testing Type 2 Slowly Changing Dimensions in a Data Warehouse

May 30, 2022 by

This article will cover testing or verification aspects of Type 2 Slowly Changing Dimensions in a Data Warehouse.

Introduction

As we discussed in a previous article, Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses, there are three main types of slowly changing dimensions, such as Type 1, Type 2, and Type 3. Out of these Type 1 is the simple dimension where you will simply maintain only the latest version of the attribute. For example, if the employee got promoted to Senior Software Engineer from Software Engineer, you will simply overwrite the existing value to the new value so that the historical aspect is lost.

Type 2 Slowly Changing Dimensions are used to track historical data in a data warehouse. This is the most common approach in dimension. This article uses a sample database of AdventureworksDW which is the sample database for the data warehouse.

Following is the schema of the DimEmployee dimension table in the AdventureworksDW database.

Table structure of DimEmployee Dimension table.

The following figure shows the sample dataset for Type 2 Slowly Changing Dimensions.

Sample data output for Slowly Changing Dimension of DimEmployee

If you further examine record number 4 and 5, you will observe that it is the same employee who has two records. This is due to the fact any attribute has changed and to track the historical data, in type 2 Slowly Changing Dimensions, the new record is added with the status current, and the previous record is set to NULL.

Having understood the basics of Type 2 Slowly Changing Dimensions, now let us look at quality assurance techniques for type 2 Slowly Changing Dimensions.

Scenario 1: Only one record of the current status record for each employee. This can be achieved by running the following query.

In the above query, Status is labeled as Current but this can be different in different implementations. EmployeeNationalIDAlternateKey is the business key for the Employee. Depending on the label you need to modify the above query.

Ideally, you should not receive any records for the above query and if there are records, the following query will provide you with the details of those records.

Scenario 2: Should not have the same version for successive records of each business key.

As we discussed at the start, a new record will be added to the dimension table when the status is changed. This scenario is only valid for successive records. This is an important scenario though this is not visible to the end-user. However, if there are unnecessary records in the dimension table, there can be performance issues. Therefore, it is important to identify these records are remove them. However, in every dimension there are internal columns such as surrogate keys, status, start date, end date, etc, that should not be matched. The following script is used to achieve the above scenario.

In the above script, the ranking function is used to match successive records. Each column is matched and if there are any matching records they will be listed out as below.

Duplicated versions for DimeEmployee

The above figure shows that there are six records with duplicate versions which should be corrected. However, this script is not feasible for practical dimensions as some dimensions will have more than 100 columns. It is not easy to type all those 100 columns. Therefore, the following script is written to find duplicate successive dimension records in a dimension bypassing a few parameters.

In the following script, you need to pass the Dimension Name, its schema name, business key, surrogate key, and the internal columns.

The above script will generate a script for the validation.

Scenario 3: Finding mismatched records that should not be modified

In some dimensions, some attributes should not be changed. For example, in the Employee dimension Names, date of birth, and joined dates should not be changed. Ideally, if there are changes, they should be overwritten. The following script can be used to achieve the above objectives.

Conclusion

In this article, we looked at how to perform testing on Type 2 Slowly Changing Dimensions to improve the performance of the data warehouse queries and other operations. This article looked at three scenarios and scripts were identified.

Dinesh Asanka
168 Views