Hadi Fadlallah

An overview of DIFFERENCE and SOUNDEX SQL functions

January 18, 2022 by

This article gives an overview of DIFFERENCE and SOUNDEX SQL Server built-in system functions. It explains how to use those functions and how do they work.

Introduction

Soundex is a phonetic algorithm developed by Robert C. Russell and Margaret King Odell in the early 1900s. This algorithm is used to index names as they are pronounced in English. The main goal of such an algorithm is to encode homophones to the same representation to be matched even if there are some slight spelling differences. As an example, consider the names “Smith” and “Smyth”, or “Mohamad” and “Mouhammad”. Soundex mainly encodes consonants and only encodes a vowel if it is the first letter of the name.

Being one of the most popular phonetic algorithms, Soundex was implemented in multiple database engines such as Oracle, SQL Server, MySQL, SQLite, and PostgreSQL.

In the following sections, we will illustrate the Soundex SQL Server functions.

SOUNDEX SQL Server built-in function

In SQL Server, SOUNDEX() is a scalar function that takes a string value as input and returns a four-character string. As mentioned before, the returned string by the Soundex SQL function is generated based on the way the input string is spoken in English.

  • The first character of the code is the first character of the input string, converted to upper case.
  • The remaining characters of the code are numbers that represent the letters in the expression.
  • Vowels and the “H” and “W” characters are ignored except if they are the first letter of the input string.
  • If the input string length is less than four, the Soundex function adds additional zeros to the returned value.

Soundex SQL function is mainly used as a fuzzy matching technique for data integration purposes. It is a collation-sensitive function.

As mentioned in the official documentation, before SQL Server 2012, the Soundex SQL function only applied a subset of the Soundex algorithm rules. In SQL Server 2012, more rules were implemented. Besides, the following restrictions will be applied once using a database compatibility level greater than or equal to 110 (SQL Server 2012):

  • A heap (table without a clustered index) that contains a persisted computed column defined with SOUNDEX cannot be queried until the heap is rebuilt by running the statement ALTER TABLE <table> REBUILD.
  • CHECK constraints defined with SOUNDEX are disabled upon upgrade. To enable the constraint, we should run the statement ALTER TABLE <table> WITH CHECK CHECK CONSTRAINT ALL.
  • Indexes (including indexed views) that contain a persisted computed column defined with SOUNDEX cannot be queried until the index is rebuilt by running the statement ALTER INDEX ALL ON <object> REBUILD.

Examples

To illustrate how the Soundex SQL functions work, we will run some examples. First, let us run the following query:

The image below shows that the Soundex function returns the same value (H000) for both “H” and “Ha” input strings. As we can see, the “a” is ignored, and three additional zeros were added to the result.

Besides, “Hd”, “Had”, and “Hadi” produce the same output (H300) since the “a” and “I” vowels as ignored, and two additional zeros are added.

Soundex SQL function example

Figure 1 – Soundex function example

Let us try to pass multiple words to the Soundex function:

The image below shows that “Hadi” and “Hadi Fadlallah” produce the same output value (H300) and characters after the space is ignored. While removing the space from the name will change the output value to “H313”. Besides, the output value of “HadiFadlallah” is the same as the output value of the first four consonants (“Hdf”).

Another example of the Soundex SQL function

Figure 2 – Another example of the Soundex function

DIFFERENCE

DIFFERENCE is a built-in scalar function used to measure the similarity of two strings using the Soundex SQL function. First, SOUNDEX() is applied to each input, and then a similarity check is done over these results. This function returns an integer value between 0 and 4. When this value is closer to 4, then inputs are very similar.

Examples

To illustrate how the Difference() SQL functions work, we will run some examples. First, let us run the following query:

The image below shows that values having the same Soundex output such as “H” and “Ha” or “Hd”, “Had”, and “Hadi” has a similarity score of 4. While “Ha” and “Had” produce a lower similarity score.

Difference SQL function example

Figure 3 – Difference SQL function example

Besides, the following example shows that adding spaces will make the Difference() function ignore all characters to the right.

As we can note, “Hadi” and “Hadi Fadlallah” have a similarity score of 4, while removing the space from “Hadi Fadlallah” will decrease the score to 2.

Second Difference SQL function example

Figure 4 – Second example of the Difference SQL function

Now, let us try to compare two of the “MSSQL” word with a similar word like “MYSQL” and another word like “POSTGRES”:

The image below shows how the similarity score is high for both “MYSQL” and “MSSQL” since only one character differs between both words, while “MSSQL” and “POSTGRES” have a very low score since they are very different.

Third example of the Difference SQL function

Figure 5 – Third example of the Difference SQL function

Summary

This article briefly explained the Soundex algorithm and how it is implemented in SQL Server. Besides, it illustrated another built-in function named Difference which uses the Soundex SQL function to calculate the similarity between two different input strings.

Hadi Fadlallah
SQL commands, String functions, T-SQL

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

4,676 Views