Gerald Britton

How to compare tables in SQL Server

September 26, 2017 by

Introduction

If you’ve been developing in SQL Server for any length of time, you’ve no doubt hit this scenario: You have an existing, working query that produces results your customers or business owners say are correct. Now, you’re asked to change something, or perhaps you find out your existing code will have to work with new source data or maybe there’s a performance problem and you need to tune the query. Whatever the case, you want to be sure that whatever changes have been made (whether in your code or somewhere else), the changes in the output are as expected. In other words, you need to be sure that anything that was supposed to change, did, and that anything else remains the same. So, how can you easily do that in SQL Server?

In short, I’m going to look at an efficient way to just identify differences and produce some helpful statistics along with them. Along the way, I hope you learn a few useful techniques.

Setting up a test environment

We’ll need two tables to test with, so here is some simple code that will do the trick:

This code creates the tables Original and Revised that hold customer data. At the moment they are completely different, which you can see since they are small. But what if these tables had thousands or millions of rows? Eyeballing them wouldn’t be possible. You’d need a different approach. Enter set-based operations!

Set-based operations

If you remember your computer science classes, you’ll no doubt recall studying sets as mathematical objects. Relational databases combine set theory with relational calculus. Put them together and you get relational algebra, the foundation of all RDBMS’s (Thanks and hats-off to E.F. Codd). That means that we can use set theory. Remember these set operations?

A ∪ B   Set union: Combine two sets into one

A ∩ B   Set intersection: The members that A and B have in common

A − B   Set difference: The members of A that are not in B

These have direct counterparts in SQL:

A ∪ B : UNION or UNION ALL (UNION eliminates duplicates, UNION ALL keeps them)

A ∩ B : INTERSECT

A − B : EXCEPT

We can use these to find out some things about our tables:

Will show us what rows these two tables have in common (none, at the moment)

Will show us all the rows of the Original table that are not in the Revised table (at the moment, that’s all of them).

Using these two queries, we can see if the tables are identical or what their differences may be. If the number of rows in the first query (INERSECT) is the same as the number of rows in the Original and Revised tables, they are identical, at least for tables having keys (since there can be no duplicates). Similarly, if the results from the second query (EXCEPT) are empty and the results from a similar query reversing the order of the selects is empty, they are equal. Saying it another way, if both sets have the same number of members and all members of one set are the same as all the members of the other set, they are equal.

Challenges with non-keyed tables

The tables we are working with are keyed so we know that each row must be unique in each table, since duplicate keys are not allowed. What about non-keyed tables? Here’s a simple example:

The last query, using EXCEPT, returns an empty result. But the tables are different! The reason is that EXCEPT, INTERSECT and UNION eliminate duplicate rows. Now this query:

Returns 3 rows:

These are the three rows that the two tables have in common. However, since each table has 4 rows, you know they are not identical. Checking non-keyed tables for equality is a challenge I’ll leave for a future article.

Giving our tables something in common

Let’s go back to the first example using keyed tables. To make our comparisons interesting, let’s give our tables something in common:

Here, we take about half the rows of the Original table and insert them into the Revised table. Using ORDER BY NEWID() makes the selection pseudo-random.

This query takes some of the rows from the Revised table and inserts them into the Original table using a similar technique, while avoiding duplicates.

Now, the EXCEPT query is more interesting. Whichever table I put first, I should get 5 rows output. For example:

Returns:

Now the two tables also have 10 rows in common:

Returns:

Depending on the change being implemented, these results may be either good or bad. But at least now you have something to show for your efforts!

Row-to-row changes

So far, we’ve only considered changes in whole rows. What if only certain columns are changing? For example, what if in the Revised table, for some customer id, the name or phone number changed? It would be great to be able to report the rows that changed and also to provide a summary of the number of changes by column and also some way to see what changed between two rows, not just visually, but programmatically. These sample tables are small and narrow. Imagine a table with 40 columns, not 4 and 1 million rows, not 10. Computing such a summary would be very tedious. I’m thinking about something like this:

This shows me that there are 8 rows with the same customer id but different contents and that four of them have different phone numbers, two have different names and two have different addresses.

I’ll also want to produce a table of these differences that can be joined back to the Original and Revised tables.

Checking comparability with sys.columns

Just because two tables look the same at first glance, that doesn’t mean that they are the same. To do the kind of comparison I’m talking about here, I need to be sure that really are the same. To that end I can use the system catalog view sys.columns. This view returns a row for each column of an object that has columns, such as views or tables. Each row contains the properties of a single column, e.g. the column name, datatype, and column_id. See the references section for a link to the official documentation where you can find all the details.

Now, there are at least two columns in sys.columns that will likely be different: object_id, which is the object id of the table or view to which the column belongs, and default_object_id, which is the id of a default object for that column, should one exist. There are other id columns that may be different as well. When using these techniques in your own work, check the ones that apply.

How can we use sys.columns to see if two tables have the same schema? Consider this query:

If the two tables really are identical, the above query would return no results. However, we need to think about the columns that may differ because they refer to other objects. You can eliminate the problem this way:

This script will create a temporary table from sys.columns for the Original table. Note that the “SELECT INTO” in this snippet just copies the schema of sys.columns to a new, temporary table. Now, we can populate it like this:

I’ve commented out the check for an identity column. You might want to exclude identity columns since they are system generated and are likely to differ between otherwise-identical tables. In that case, you’ll want to match on the business keys, not the identity column. In the working example though, I explicitly set the customer ids so this does not apply, though it very well might in the next comparison. Now, so we don’t compare the columns that we know will be different:

Repeat the above logic for the target table (The Revised table in the working example). Then you can run:

In the working example, this will return no results. (In case you are wondering, SELECT * is fine in this case because of the way the temporary tables are created – the schema and column order will be the same for any given release of SQL Server.) If the query does return results, you’ll have to take those differences into account. For the purpose of this article, however, I expect no results.

Creating differences

Starting with the Original and Revised tables, I’ll create some differences and leave some things the same:

This query mixes up the data in the columns of the Revised table in randomly-selected rows by simple transpositions and arithmetic. I use ORDER BY NEWID() again to perform pseudo-random selection.

Computing basic statistics

Now that we know the tables are comparable, we can easily compare them to produce some basic difference statistics. For example:

Returns:

for the working example. However, I want to go deeper!

Using SELECT … EXCEPT to find column differences

Since SQL uses three-value logic (True, False and Null), you might have written something like this to compare two columns:

To check if columns from two tables are different. This works of course, but here is a simpler way!

This is much easier to write, is DRYer (DRY = Don’t Repeat Yourself) and takes care of the complicated logic in the original WHERE clause. On top of that, this does not cause a performance problem or make for a suddenly-complicated execution plan. The reason is that the sub query is comparing columns from two rows that are being matched, as in a JOIN for example. You can use this technique anywhere you need a simple comparison and columns (or variables) are nullable.

Now, applying this to the working example, consider this query:

At its heart, this query joins the Original and Revised tables on customer id. For each pair of rows joined, the query returns a new column (called ‘cols’) that contains a concatenated list of column names if the columns differ. The query uses the technique just described to make things compact and easy to read. Since the query creates a new temporary table, let’s look at the contents:

The eight rows that differ, differ in specific columns. For each row that differs, we have a CSV list of column names. (The leading and trailing commas make it easy to pick out column names, both visually and programmatically). Although there is only one column listed for each customer id, there could be multiple columns listed and in a real-world scenario likely would be. Since the newly-created temp table has the customer id in it, you can easily do a three-way join between the Original and Revised tables to see the context and content of these changes, e.g.

Returns:

For a very-wide table, this makes it easy to zero in on the differences, since for each row, you have a neat list of the columns that differ and you can select them accordingly.

Generating detail difference statistics

There’s one other thing I can do with the temp table we just created. I can produce the table of differences I wanted. Here, I’m using those commas I inserted in the CSV column list to make it easy to search using a LIKE operator.

This is an interesting query because I am not returning anything from the temp #diffcols table. Instead I use that table to create the sums of the differences then arrange the finally result using CROSS APPLY. You could do the same thing with UNPIVOT, but the CROSS APPLY VALUES syntax is shorter to write and easier on the eyes. Readability is always important, regardless of the language. There’s an interesting web site dedicated to writing obfuscated C. It’s interesting to see how much you can get done with a write-only program (one that you can’t read and make sense of). Don’t write obfuscated SQL, though!

This query returns:

Just what I wanted!

Creating a stored procedure to help

If you’ve been following along, you’ve probably already realized that this will be tedious to write for anything but a trivial example. Help is on the way! I’ve included a stored procedure that constructs the queries discussed in this article, using nothing more than the table names and a list of key columns to use for the join predicate, that you are free to use or modify to suit.

Summary

Comparing tables or query results is a necessary part of database development. Whether you are modifying an application that should change the result or making a change that should not change the result, you need tools to do this efficiently.

Downloads

References

Gerald Britton
Latest posts by Gerald Britton (see all)
Comparison

About Gerald Britton

Gerald Britton is a Senior SQL Server Solution Designer, Author, Software Developer, Teacher and a Microsoft Data Platform MVP. He has many years of experience in the IT industry in various roles. Gerald specializes in solving SQL Server query performance problems especially as they relate to Business Intelligence solutions. He is also a co-author of the eBook "Getting Started With Python" and an avid Python developer, Teacher, and Pluralsight author. You can find him on LinkedIn, on Twitter at twitter.com/GeraldBritton or @GeraldBritton, and on Pluralsight View all posts by Gerald Britton

168 Views