This article helps you to understand the usage of the Fuzzy Lookup Transformation in SQL Server Integration Services (SSIS).
SQL Server Integration Services (SSIS) is said to be a zero-code tool that can be used to integrate data from multiple sources. If you are working with multiple data sources, there can be cases where there are issues with data integration.
Let us look at this by means of an example. There was a requirement to integrate two systems to generate reports. Those two systems are Production and Human Resource Management. Due to the fact that this was built by two different vendors, these two cannot be integrated technically. The production system has details of the 150 workers and the human resource management system has all the details of all the employees. Since Employee ID cannot be used to match employee, employee name was selected to match these employees. However, due to the fact that these were differently managed systems, out of the 150 workers, only 125 workers were able to match. Those employees’ names were matched with the exact name. The rest of the 25 employees cannot be matched because there are small spelling, space issues. However, for the successful analysis, it is important to match these employees.
In SSIS, typically, we use Lookup transformation for exact matching. However, we need more than a standard lookup transformation to match these partial matching columns. To match these types of partial matching scenarios, there are two controls is SSIS toolset. There are Fuzzy Lookup and Fuzzy Grouping transformations in SSIS to support these types of scenarios. In this article, let us see how we can use Fuzzy Lookup in SSIS to solve the above problem.
I assume you are familiar with the basics of SSIS and also know how to create an SSIS package. In case you are not, you can check these tutorials here on Microsoft Documentation before proceeding with this article.
First, let us see how we can use standard Lookup transformation. Let us see the following as the sample reference data set.
Let us see the following as the input data.
This data has to be matched with the reference data. Let us create an SSIS project from SQL Server Data Tools (SSDT). Lookup transformation was configured as the following screenshot.
In the above SSIS package, Data Flow control was configured in the SSIS control flow as above. The Flat file was configured to a Flat File source. Since Lookup needs the same data type, data from the Flat file is converted to nvarchar data type in the Data Conversion task.
In the Lookup control, the configuration is done so no match records will be set to a different path. The following records will not be matched.
However, if you closely analyze mismatch records with the reference data set, you will see that they are not matching, purely because there are few typing mistakes rather than they are completely missing. Therefore, it is better if we can match them using any other techniques such as Fuzzy Lookup in SSIS.
Fuzzy Lookup in SSIS
Let us see how fuzzy lookup can be configured.
First, it has to be a reference table. Though you can use a query in a lookup, in a fuzzy lookup in SSIS, it has to be either a table or a view.
An index is used to store the tokenization values. We will discuss the concept of tokenization later. If it is a large table, better to store an index in a table. If the reference table is a static table, which is not changing, you can use an existing index. However, since you are not sure about data changes, it is better to configure, as shown in the above screenshot.
If you have selected Store new index option above, you would have to specify whether you also want SQL Server to maintain stored index. To do that, make sure to enable CLR using the following code.
EXEC sp_configure 'clr enabled', 1;
The next step is to define the matching columns from the below screen.
Since this is fuzzy matching, these columns have to be either varchar or nvarchar data types. However, both columns should be of the same data type. In this example, since FullName is nvarchar, the input column data type should be converted to nvarchar.
The advanced tab offers further configurations to the Fuzzy Lookup.
Since these are not full match attributes, there can be multiple columns that are matching with the reference table. You can define the Maximum number of matches to output per lookup setting. By default, this is 1.
Similarity Threshold defined how well it is matching. Typically this threshold is set to a value between 0.75 to 0.80. If this value is close to 1, the attribute is closely matching. Token delimiters will define the tokenization process. Tokenization is a step that splits the text into smaller words or tokens. Larger strings can be tokenized into words, so that fuzzy lookup is done on the tokenized text words.
Now let us execute this SSIS package and verify the results as shown in the below screenshot.
If we enable the data viewer after the fuzzy lookup, you will see the following data viewer.
In this data viewer, you will see that the input Name and reference table column, in this case, it is Full Name.
When the _Similarity is 1, it is exactly a match. Confidence defines the quality of the match.
Next is to work with these confidence levels. We can use a Conditional Split Transformation to define a few rules, as shown in the below screenshot.
Here we have defined, it is fuzzy matched when the confidence is more than 0.5 and if the confidence is between 0.25 to 0.5, it is not sure, but it is likely a match. Similarly, we have defined that if the Confidence is less than 0.25, it is not a match. These values may change depending on the data set that you will be using. It is better to set these values as package configurations so that it can be changed when necessary.
Following is the total implementation of the Fuzzy Lookup and Conditional Split transformation in SSIS, as shown in the below screenshot.
Depending on the condition, you may choose different operations. For example, if it is a fuzzy match, you can consider it as a correct match. If it is a no match, it will be an error. Likely, Match means that manual intervention is needed.
Since the fuzzy lookup is a performance incentive task, it is better to use the transformation for only the required data. Therefore, it is better if we can use this fuzzy lookup transformation only for the data, which is not possible with the standard lookup transformation, as shown in the following screenshot.
As you can see in the above screenshot, a standard lookup is done on the data set initially. Only the no matched data is processed through the Fuzzy Lookup is SSIS. Then both streams of data were combined using the Union ALL task.
Fuzzy Grouping Transformation
Fuzzy Grouping transformation is used to group the data within the same data set rather than as a matching technique. For example, if you get a list of employees in text files, within the text files, there can be the same name duplicated but with different spellings. Fuzzy Grouping technique can be used to find the same name in the same list.
- We have used the only attribute for comparison. Since this is partial matching, it is better to use as much as attributes possible. You will get the similarities for each column
- When deciding what columns to match, make sure to choose columns that have a number of distinct values rather than more common values
- Though in standard queries, NULL is not equal to another NULL, in case of fuzzy matching, it is equal. Therefore, it is important to avoid NULL values in both the input and output
Data Cleansing is an important technique in data analysis as non-cleaned data may result in wrong decisions. The Fuzzy Lookup in SSIS is a sophisticated method that can be used to clean data. However, there are performance issues with Fuzzy Lookup that has to be used accordingly.
- Incremental Data Extraction for ETL using Database Snapshots - January 10, 2022
- Use Replication to improve the ETL process in SQL Server - November 4, 2021
- Text Classification in Azure Machine Learning using Word Vectors - October 1, 2021