Frank Solomon
The T-SQL PARTITION BY clause in a ROW_NUMBER() function.

Lever T-SQL to handle duplicate rows in SQL Server database tables

April 24, 2020 by

Duplicate rows in a SQL Server database table can become a problem. We will see how we can find and handle those duplicate rows using T-SQL in this article.

Introduction

We might discover duplicate rows in a SQL Server database table. This article will describe T-SQL techniques to find them, and if necessary, remove them.

Sample Database

For this article, we’ll focus on a sample database called OFFICE_EQUIPMENT_DATABASE, Built-in SQL Server 2014 Standard Edition, on an updated Windows 10 PC. The database has this overall structure:

The OFFICE_EQUIPMENT_DATABASE, as seen in the SQL Server Object Explorer.

This database has no stored procedures, no user-defined functions, etc. It has one table, called OFFICE_EQUIPMENT as shown here:

The OFFICE_EQUIPMENT table columns.

The following script will build the OFFICE_EQUIPMENT_DATABASE database, its table, and the rows in that table:

This script places the database LDF and MDF files in the C:\OFFICE_EQUIPMENT_DATABASE directory. We can manually create this directory, but the screenshot below shows the T-SQL call to an extended stored procedure xp_create_subdir at line 19. We can create the directory if we manually “paint” and execute this line. As an extended stored procedure, Microsoft could scrap xp_create_subdir at any time, but the above script includes it as an option. Of course, we can uncomment line 19 before running the script.

In the OFFICE_EQUIPMENT_DATABASE T-SQL creation script, line 19 can create the C:\OFFICE_EQUIPMENT_DATABASE directory

The Table Data

Run the below statements to see all of the OFFICE_EQUIPMENT table rows, as seen in this screenshot:

The OFFICE_EQUIPMENT table rows.

Every row has a unique OFFICE_EQUIPMENT_ID value. In a way, the table has unique rows because the OFFICE_EQUIPMENT_ID column operates as a primary key field in this table. If we ignore this column, however, we can clearly see that many rows have duplicate data shared across other columns. For example, this code returns rows with duplicate non-ID values:

This screenshot shows the result set:

Duplicate rows in the OFFICE_EQUIPMENT table.

This code returns another result set with similar behavior:

This screenshot shows the second result set:

T-SQL query to show duplicate rows in the OFFICE_EQUIPMENT table.

Simply looking for duplicate rows by eye might work for a sixteen-row table, but it won’t work for a table with thousands, or millions, of rows. We need a better technique.

Identify Duplicate Data Rows

In an earlier SQL Shack article Overview of the SQL ROW_NUMBER function, Prashanth Jayaram explored the SQL Server ROW_NUMBER() function. This function will become an important part of the solution we’ll build. When we use the T-SQL ROW_NUMBER() function, we must also add a certain SQL clause to separate, or partition, a query result set. This partition separates the result set rows into at least one defined subset of rows, only for use by the ROW_NUMBER() function. The partition process might produce one subset that covers all the rows – that’s OK. ROW_NUMBER() needs this separation to support its row numbering operations. The partitioning does not change the data in any way. Prashanth explained that the ROW_NUMBER() function will set a sequential number, in a separate column, for every row in the row subset partition. The OVER() clause placed right after the ROW_NUMBER() function handles the partitioning task for ROW_NUMBER(). In this T-SQL query, the ROW_NUM column sequentially numbers every row:

This screenshot shows the result set:

The OFFICE_EQUIPMENT table rows, with ROW_NUMBER() values in the ROW_NUM column.

Microsoft explains that the OVER() clause “ . . . determines the partitioning and ordering of a rowset . . .” Its syntax offers many options, but to solve our specific problem, we’ll use the ORDER BY clause with the ROW_COUNT() function. The ORDER BY clause as used here expects at least one column, and we’ll use the OFFICE_EQUIPMENT_NAME column, as seen at line 6 in the above screenshot. Any specific table column, or any combination of table columns, will work here. The T-SQL ROW_NUMBER() function will get the complete OVER() clause it wants, and it can then set a ROW_NUM value for each row. As a side effect, this ORDER BY clause will also determine the row order of the final query result set. We made some progress, but we must define the rules that make rows identical. Then, we must gather those defined duplicate rows together. The PARTITION BY clause will solve this part of the problem.

Placed in the OVER() clause, the T-SQL PARTITION BY clause divides, or separates, the result set rows into subsets, or partitions. The OVER clause can use these subsets in its own operations that support other SQL Server functions. We can use the OVER() clause here to build a definition of duplicate rows.

In the below query, the OVER clause has a PARTITION BY clause, in addition to its required ORDER BY clause. As seen in a new query tab, this screenshot highlights the PARTITION BY clause:

The T-SQL PARTITION BY clause in a ROW_NUMBER() function.

As used in the OVER() clause of this example, the PARTITION BY clause has a list of columns. In this case, PARTITION BY has columns from the OFFICE_EQUIPMENT table. When the ROW_NUMBER() function separates, or partitions, the result set rows, it uses the PARTITION BY column list to decide which columns have the values that define duplicate rows. This screenshot shows the query result set:

Identified duplicate rows in the OFFICE_EQUIPMENT table.

In this query, PARTITION BY looks at the values in the columns and if it finds result set rows that have identical values in all of these columns, it places these rows into a separate partition.

If it can build more than one different partition from the same result set, it will do so. The highlighted result set rows have identical values in these columns, so PARTITION BY brought them together in a separate partition. Then, the T-SQL ROW_NUMBER() function numbered these rows in the ROW_NUM column, starting at 1 and ending at 3. At rows 11 through 14, PARTITION BY brought another set of rows together, as shown in this screenshot:

Identified duplicate rows in the OFFICE_EQUIPMENT table.

ROW_NUMBER(), OVER(), and PARTITION BY make no changes to the original data in any source database tables. Also, the ORDER BY clause in the T-SQL OVER() clause handles one, some, or all of the OFFICE_EQUIPMENT table columns, as seen in this screenshot:

The ORDER BY clause in the OVER() clause.

In the OFFICE_EQUIPMENT table, the OFFICE_EQUIPMENT_ID column has unique values in all of the rows. If we add this column to the PARTITION BY column list, the ROW_NUM will have values of 1 in all rows, as shown in this screenshot:

The OFFICE_EQUIPMENT table rows, with T-SQL  ROW_NUMBER() function values in the ROW_NUM column.

This makes sense, because PARTITION BY now looks at a column set with unique values in at least one column – in this case, OFFICE_EQUIPMENT_ID. Therefore, every row that PARTITION BY sees has a unique set of values, and it will place every row in a separate partition. Those partitions will have only one row. Therefore, every row, in every partition, will have a ROW_NUM value of 1.

The PARTITION BY clause offers fine-grained control over the columns we use to define the duplicate rows that form the partitions themselves. This query modifies an earlier query, commenting out the OFFICE_EQUIPMENT_DESCRIPTION column in the PARTITION BY clause:

This screenshot shows the result set:

Identified duplicate rows in the OFFICE_EQUIPMENT table.

The T-SQL PARTITION BY clause gathered the highlighted rows into one separate partition, similar to the highlighted rows shown three screenshots above. In this case, however, row 15, with OFFICE_EQUIPMENT_ID value of 16, has an OFFICE_EQUIPMENT_DESCRIPTION value of “35 LB. PRINTER PAPER (ONE REAM)” which differs from the values of this column in the other rows in this subset of rows. Additionally, row 14, with OFFICE_EQUIPMENT_ID value of 14, has a PURCHASE_DATE value of 2018-03-17 00:00:00.000 which differs from the values of this column in the other rows in this subset of rows.

The query gathered and numbered these rows in the same row partition because the T-SQL PARTITION BY clause in this query defines duplicate rows based on the columns.

It ignores columns with potentially different values, as it builds the partitions of rows it defines as identical.

Handling Duplicate Data Rows

Now that we can define and identify duplicate rows, we can decide what to do with them. Before we begin, this script will delete and reinsert the original rows in the OFFICE_EQUIPMENT table:

We saw this query earlier in this article:

This query partitions the OFFICE_EQUIPMENT table on these columns:

Duplicate rows identified in the OFFICE_EQUIPMENT table, identified in the ROW_NUM column.

In this screenshot, the ROW_NUM column values greater than 1 flag the duplicate rows, as we defined them in the PARTITION BY clause. We can try to return only those duplicate rows, with this query:

However, this won’t work, as this screenshot shows:

The T-SQL WHERE clause does not work in a SELECT query with a ROW_NUMBER() expression.

Microsoft explains that the WHERE clause executes before the SELECT clause. As a result, the WHERE clause in this query will never see the ROW_NUM column, and the query returns an error. However, the Microsoft source also explains that the FROM clause executes first. With this in mind, we can rebuild the query. We’ll use the above query as a subquery in the FROM clause of a parent query, as seen here:

This screenshot shows that the WHERE clause filter worked:

The WHERE clause does work in a subquery of a SELECT query that has a T-SQL ROW_NUMBER() expression.

Expanding on this approach, this query will delete the duplicate rows:

This screenshot shows that the query worked:

Place the ROW_NUMBER() function in a subquery to support a DELETE query.

The information from Microsoft helped out. In a parent query, we placed the original ROW_NUMBER() query as a subquery in the FROM clause. We aliased this subquery as DRT at line 15, and joined DRT to the OFFICE_EQUIPMENT table at lines 15 and 16. The parent query WHERE clause at line 17 filtered the duplicate rows found in the DRT subquery.

We can use this query to test the DELETE:

Using this test query, this screenshot verifies the DELETE query:

Instead of deleting the duplicate rows, we might want to identify them first, and decide what to do later. We can modify the DELETE query we saw earlier to build a new query as seen here and in the screenshot below.

The T-SQL ROW_NUMBER() function helps build a list of duplicate row OFFICE_EQUIPMENT_ID values

Similar to the earlier DELETE query, we placed the original ROW_NUMBER() query as a subquery in the FROM clause. The FROM clause at line 5 executes first – before the line 16 WHERE clause. This means that the T-SQL WHERE clause can see the ROW_NUMBER() data, and filter on it. Now that we have the specific OFFICE_EQUIPMENT_ID values that we need in a list, we can place these values in another table, use them in other queries, etc.

Conclusion

As database resources grow in size and importance, we need to measure their data quality, to identify and fix problems. As part of that, we should identify duplicate database table rows. Then, we can delete them. We can move them to other tables, other databases, or even other servers. Once we find those duplicate rows, we can decide what to do next. This article shows how to lever T-SQL to find those duplicate rows.

Frank Solomon
168 Views