Prashanth Jayaram

How to do a quick estimated compare of data in two large SQL Server databases to see if they are equal

August 29, 2018 by

Bringing impactful analysis into a data always comes with challenges. In many cases, we rely on automated tools and techniques to overcome many of these challenges.

In this article, we’ll describe a simple task to validate the table (row count only) between the databases on different SQL instances. For our use case, let us consider a scenario involving transactional replication, as a good practice; it is required to perform data validation test between Publisher and Subscriber databases to validate the integrity of our replication process. It is quite evident that a simple validation process helps to identify the state of the subscriber. If there is a difference in the data, it’s pretty much advised to re-initialize the subscription with the latest snapshot.

Introduction

There are several reasons why we might need to consider table comparison. In a robust replication system, one has to audit the integrity of the data. The audit process ensures that the data produced from various automation tools or batch jobs are correct. Also, when you pass the values on the subscription database, it should return the near-production values.

SQL Server Transactional replication has an option to validate the data at the Subscriber with data at the Publisher. Validation can be performed for specific subscriptions or for all subscriptions to a publication. We can specify the ‘Row count only’ option as the validation type and the Distribution Agent will validate data when it runs for the next time. But if we create our own replication scenario, independent of SQL Server replication, we will want to duplicate this feature to ensure our databases are in sync

In this article, we will simulate the “Row count only” process of Microsoft’s SQL Server database replication solution. In this validation process, the tables are compared for a number of rows between Publisher and Subscriber but it does not validate the content of the data. It is considered as a lightweight approach to validate the number of rows between publisher and subscriber.

Objective

Implement “Row count only” process to compare a number of rows between Publisher and subscriber database and send out an alert email when the process finds a difference in the numbers of the rows between the tables.

Overview of the implementation

Automation

The steps to setup automation are as follows:

  1. Enable “Ad Hoc Distributed Queries” advanced configuration option

  2. Use OPENROWSET to query a remote SQL instance. To use the OPENROWSET ad hoc method, provide all the data source information in the connection string. For example, to run a remote query on XYZ server

    1. Use SQLNCLI provider
    2. Pass the <ServerName> as an input value to Server parameter
    3. Set the connection type. In this case, it’s trusted. So, its set to ‘yes’
    4. Pass T-SQL string

  3. Getting the row count of tables in the source database

    There are several different ways get the row-count of the table in a database, but probably the fastest technique to get row count is by using system view sys.partitions.

    Using the following SQL, the row count of the table is listed.

  4. Getting the row count of tables in the remote database

    In order to do this, the OpenRowSet method is used. You can refer step2 for more information.

  5. Compare the number of rows between source and Target

    The step 3 sets up first part of the data set and step 4 defines the second data set. Both the result sets are captured in a table variable named @sourcedatabase and @targetDatabase.

  6. Join the two table variable using schema and tablename columns

    You can see the difference in number of rows in the following output

  7. Prepare email alert

    In this section, Let us make the necessary arrangement to wrap the values in the table data <TD> tag. This is the format needed to build HTML table tags. In this case, td is used as the column alias for all columns and specified by the ‘tr’ as the root element for each row. Then the concatenated output is sent as an HTML body using sp_send_dbmail option.

Output

Wrap Up

So far, we’ve seen the steps to simulate “Row count only” option. In reality, there is no ideal method can be used to compare the results. If you’re working one of the very few tables, then tablediff is also an option. Also, you could rely on 3rd party tools such as ApexSQL Diff data compare option to validate a number of rows between the databases. I would leave this option open. What are your favorite methods? Please leave the feedback input in comments…

That’s all for now…

Appendix

Another solution we can use is tablediff utility. For example, if you want to compare two tables (ExportALLCities) from two databases on two different servers, you run the following command

C:\Program Files\Microsoft SQL Server\140\COM>tablediff -sourceserver hqdbt01 -sourcedatabase WideWorldImporters -sourcetable ExportALLCities -destinationserver HQDBT01\SQL2017 -destinationdatabase WideWorldImporters -destinationtable ExportALLCities

T-SQL Script

Table of contents

SQL Server replication: Overview of components and topography
SQL Replication: Basic setup and configuration
How to Add/Drop articles from existing publications in SQL Server
How to do a quick estimated compare of data in two large SQL Server databases to see if they are equal
SQL Server transactional replication: How to reinitialize a subscription using a SQL Server database backup
How to setup a custom SQL Server transaction replication model with a Central Subscriber and Multiple Publisher databases
How to setup custom SQL Server transactional replication with a central publisher and multiple subscriber databases
How to set up a DDL and DML SQL Server database transactional replication solution
How to setup cross-platform transactional SQL Server replication for database reporting on Linux
SQL Server database migrations with zero data loss and zero downtime
Using transactional data replication to replay and test production loads on a staging server
How to setup SQL Server database replication for a reporting server
SQL Server transactional replication: How to reinitialize a subscription using a “Replication support only” –TBA
SQL Server Replication Monitoring and setting alerts using PowerShell –TBA
Prashanth Jayaram
Comparison, Replication

About Prashanth Jayaram

I’m a Database technologist having 11+ years of rich, hands-on experience on Database technologies. I am Microsoft Certified Professional and backed with a Degree in Master of Computer Application. My specialty lies in designing & implementing High availability solutions and cross-platform DB Migration. The technologies currently working on are SQL Server, PowerShell, Oracle and MongoDB. View all posts by Prashanth Jayaram

168 Views