Rahul Mehta
Sample Adventureworks customer table

Finding Duplicates in SQL

February 7, 2024 by

This article will provide an understanding of identifying duplicate values in SQL.

Introduction

SQL Server is one of the most widely used databases at the current time. Many huge organizations have been using it for decades now. Many Datawarehouse banks on storing data in SQL and processing it for advanced analytics and insights. Even visualization layers like Power BI also supports the greatest number of features and functionalities with SQL server. One of the common issues that come with any kind of data storage is the “Unnormalized form” of data. In simpler words, it deals with raw and unmanaged data which can have a lot of DUPLICATES in it. Let’s understand it with the below examples:

Sample Table with multiple duplicate values

How to Identify Duplicates:

Now let’s say we have an informative table of “People Records”. We need to find duplicates with a combination of first name and last name. As you can see in the above table, we have two people with the name “Mike White”. However, they are NOT duplicates as we can see from age and birthplace.

However, they are duplicates in the following ways:

  1. Number of occurrences of more than 1
  2. Number of occurrences of more than 1 for a limited set of metadata
  3. Number of occurrences of more than 1 for a limited set of metadata for a single or combination of multiple columns

Let us understand point number 3 a little more elaboratively.

Single Column Duplicates:

Sample Table with multiple duplicate values

In the above image, let’s take the “First Name” column as an example. There are a total of five records in which records with “Mike” has three occurrences. At times, there would be a need to find out several records with many occurrences like:

Summary table showing occurrences of First Name

These can be called “Single Column Duplicates” with their occurrences count.

Multiple Column Duplicates:

Sample Table with multiple duplicate values

Taking the same example shown above, there will be a need at times to find duplicates with a combination of multiple columns. In the above example though there are three occurrences of “Mike”, however, there are only two occurrences of “Mike White” with a combination of “First Name” and “Last Name”.

SQL provides multiple ways to find out duplicates in a single column or multiple records. Below are the three ways:

DISTINCT and COUNT:

Distinct is a function provided by SQL to get the distinct values of any given column in SQL tables. COUNT is a function that gives the count of the number of records of a single or combination of columns. When DISTINCT and COUNT are used together, you can find out whether there are duplicates in the column or not.

Pros:

Easier ways to find out duplicates in column

Cons:

Lack of options to show it with the number of occurrences.


Lack of options to show other metadata of records with occurrences.

Let’s take an example. I have created a database and I am using the “AdventureWorks” sample database as shown below (a few records):

Sample Adventureworks customer table

Now let’s say, I want to find whether FirstName has duplicate records or not, then below steps must be followed:

  1. Find the Total Count of FirstName

    To do so, we can use the below query:

    Showing total number of Firstname in customer table

    As shown, in the above image shows 847 as the count of records

  2. Find a Distinct Count of FirstName

    Before we get a distinct count, I would like to show how the DISTINCT function gives unique values of a column. To do so, use the below query:

    Showing unique firstnames in customer table

    In the above image, we can see various values being pulled in. These all are unique “FirstNames”. However, we aim to find out a total of such unique names. To do so, you can write the below query:

    Showing count of unique first names in customer table

    We can see there are only 315 unique First Names.

  3. Subtract the Total Count of FirstName from the Distinct Count of FirstName from

    Now the last step is the easiest to subtract the total from distinct. To do so, write a query as below:

    Showing total number of duplicates first names in customer table

    As per the above result, we can see there are 532 duplicate records in total. However, the challenge is in case of we need to know how many duplicates exist in each duplicate record, then it would be a little complex to write a query. There is where our second options come into the picture

GROUP BY AND COUNT

Group By is a function that groups all the unique values of single or multiple columns and gets the result. It is a way of consolidating and summarizing unique data to get a unique summary. COUNT function as well saw earlier provides the number of occurrences; however, it differs a little when used with the Group By function. When Count is used with the Group By function, it provides the number of occurrences individually for Each unique value.

Pros:

Easily able to get unique and duplicate records with individual occurrence count

Cons:

A little complex to get the additional metadata of duplicate records

Let’s take the sample example we were using earlier. We will use the same AdventureWorks database and customer table to fetch all unique “FirstName” along with their duplicate counts. To do so, please write below the query

Showing all unique first name with total number of occurrences

Below is a subset of all the unique “FirstNames” along with their count. We can see many records have more than 1 total occurrence. However, there are records like “Alberto” which has only 1 occurrence which means it doesn’t have any duplicate records. To exclude such records for us to see the actual duplicate records, we need to use the Having keyword with Group By clause. Having a keyword is as equivalent to the Where Clause in which you mention your criteria. In our current example, we will mention criteria as the count of occurrences should be greater than 1. To do so, please update the query to:

Showing only duplicate first names with total number of occurrences

We can see now in the above results that only the records which have multiple occurrences appear in the results. This will be the way to find the duplicate values for a single column. In case you want to see unique records, then all you need to do is change the having clause to meet the count of records to 1.

At times, architects use multiple columns to define a sub-primary key or unique identifier of a table. For example, let’s say in the below table a combination of “FirstName + LastName” makes unique records. So sometimes, you need to find duplicates with a combination of FirstName and LastName. To do so, simply extend the Group By clause query as below:

Showing only duplicate first names and last anmes with total number of occurrences


Few things to observe in the above results:

  • The duplicate records are considered with a combination of “FirstName” and “LastName”
  • In the query, we must mention all the column names in the “Group By” clause which needs to be in combination
  • When Count(*) is mentioned in the Having clause, it means it will take a count of a combination of all the columns mentioned in the Group By clause.

Thus, in this way we can use Group By clause could be used to find duplicates in multiple columns. However, still, let’s say we need to get additional metadata along with duplicate values, it will be a little hard in Group By clause. That is where ROW_NUMBER functions come into the picture.

ROW_NUMBER

The ROW_NUMBER() is a window function that assigns a sequential integer to each row within the partition of a result set. In our scenario, we need to use this function to bifurcate the columns we want to find duplicates for and mention the rest of the columns as additional metadata to provide more information. To do so, use the below query

Showing only duplicate first names and last names with total number of occurrences and additional columns

Few points to observe:

  1. The result set shows additional metadata along with the number of occurrences in it.
  2. We have used ROW_NUMBER Function to get the partition data where the combination of the FirstName and LastName columns has more than 1 occurrence
  3. ROW_NUMBER function should have:
    1. An Over clause mentioning the partition
    2. An Order BY clause mentioning the column names
    3. An Alias name for the entire subset to match the count

Please note that we have used all the options independently. But we can use it with different combinations in more complex scenarios.

Conclusion

In this article, we have learned:

  • Different ways to find duplicate values in SQL.
  • How to use the DISTINCT, GROUP BY, COUNT, and ROW_NUMBER functions.
  • How to use these functions in query, functions, and other objects to get unique values.
Rahul Mehta
Latest posts by Rahul Mehta (see all)
T-SQL

About Rahul Mehta

Rahul Mehta is a Software Architect with Capgemini focusing on cloud-enabled solutions. He works on various cloud-based technologies like AWS, Azure, and others. He has worked internationally with Fortune 500 clients in various sectors and is a passionate author. View all posts by Rahul Mehta

168 Views