Derik Hammer

Measuring Availability Group synchronization lag

August 9, 2016 by

With all of the high-availability (HA) and disaster recovery (DR) features, the database administrator must understand how much data loss and downtime is possible under the worst case scenarios. Data loss affects your ability to meet recovery point objectives (RPO) and downtime affects your recovery time objectives (RTO). When using Availability Groups (AGs), your RTO and RPO rely upon the replication of transaction log records between at least two replicas to be extremely fast. The worse the performance, the more potential data loss will occur and the longer it can take for a failed over database to come back online.

Availability Groups must retain all transaction log records until they have been distributed to all secondary replicas. Slow synchronization to even a single replica will prevent log truncation. If the log records cannot be truncated your log will likely begin to grow. This becomes a maintenance concern because you either need to continue to expand your disk or you might run out of capacity entirely.

Availability modes

There are two availability modes, synchronous commit and asynchronous commit. Selecting a mode is equivalent to selecting whether you want to favor data protection or transaction performance. Both availability modes follow the same work flow, with one small yet critical difference.

With synchronous commit mode, the application does not receive confirmation that the transaction committed until after the log records are hardened (step 5) on all synchronous secondary replicas. This is how AGs can guarantee zero data loss. Any transactions which were not hardened before the primary failed would be rolled back and an appropriate error would be bubbled up to the application for it to alert the user or perform its own error handling.

With asynchronous commit mode, the application receives confirmation that the transaction committed after the last log record is flushed (step 1) to the primary replica’s log file. This improves performance because the application does not have to wait for the log records to be transmitted but it opens up the AG to the potential of data loss. If the primary replica fails before the secondary replicas harden the log records, then the application will believe a transaction was committed but a failover would result in the loss of that data.

Measuring potential data loss

Thomas Grohser once told me, “do not confuse luck with high-availability.” A server may stay online without ever failing or turning off for many years but if that server has no redundancy features then it is not highly-available. That same server staying up for the entire year does not mean that you can meet five nines as a service level agreement (SLA).

Policy based management is one method of verifying that you can achieve your RTOs and RPOs. I will be covering the dynamic management view (DMV) method because I find it is more versatile and very useful when creating custom alerts in various monitoring tools. If you would like to read more on the policy based management method, review this BOL post.

Calculations

There are two methods of calculating data loss. Each method has its own quirks which are important to understand and put into context.

Log send queue

Tdata_loss = log_send_queue / log_generation_rate

Your first thought might be to look at the send rate rather than the generation rate but it is important to remember that we are not looking for how long it will take to synchronize, we are looking for what window of time will we lose data in. Also, it is measuring data loss by time rather than quantity.

This calculation can be a bit misleading if your write load is inconsistent. I once administered a system which used filestream. The database would have a very low write load until a 4 MB file was dropped in it. The instant after the transaction was committed the log send queue would be very large while the log generation rate was still showing very low. This made my alerts trigger even though the 4 MB of data was synchronized extremely fast and the next poll would show that we were within our RPO SLAs.

If you chose this calculation you will need to trigger alerts after your RPO SLAs have been violated for a period of time, such as after 5 polls at 1 minute intervals. This will help cut down on false positives.

Last commit time

Tdata_loss = last_commit_timeprimary last_commit_timesecondary

The last commit time method is easier to understand. The last commit time on your secondary replica will always be equal to or less than the primary replica. Finding the difference between these values will tell you how far behind your replica lags.

Similar to the log send queue method, the last commit time can be misleading on systems with an inconsistent work load. If a transaction occurs at 02:00am and then the write load on the database goes idle for one hour, this calculation will be misleading until the next transaction is synchronized. The metric would declare a one-hour lag even though there was no data to be lost during that hour.

While misleading, the hour lag is technically accurate. RPO measures the time period where data may be lost. It does not measure the quantity of data which would be lost during that time frame. The fact that there was zero data to be lost does not alter the fact that you would lose the last hours’ worth of data. It being accurate still skews the picture, though, because if there was data flowing you would not have had a one hour lag indicated.

RPO metric queries

Log send queue method


Last commit time method

NOTE: This query is a bit simpler and does not have to calculate cumulative performance monitor counters.


Recovery time objective

Your recovery time objective involves more than just the performance of the AG synchronization.

Calculation

Tfailover = Tdetection + Toverhead + Tredo

Detection

From the instant that an internal error or timeout occurs to the moment that the AG begins to failover is the detection window. The cluster will check the health of the AG by calling the sp_server_diagnostics stored procedure. If there is an internal error, the cluster will initiate a failover after receiving the results. This stored procedure is called at an interval that is 1/3rd the total health-check timeout threshold. By default, it polls every 10 seconds with a timeout of 30 seconds.

If no error is detected, then a failover may occur if the health-check timeout is reached or the lease between the resource DLL and SQL Server instance has expired (20 seconds by default). For more details on these conditions review this book online post.

Overhead

Overhead is the time it takes for the cluster to failover plus bring the databases online. The failover time is typically constant and can be tested easily. Bringing the databases online is dependent upon crash recovery. This is typically very fast but a failover in the middle of a very large transaction can cause delays as crash recovery works to roll back. I recommend testing failovers in a non-production environment during operations such as large index rebuilds.

Redo

When data pages are hardened on the secondary replica SQL Server must redo the transactions to roll everything forward. This is an area that we need to monitor, particularly if the secondary replica is underpowered when compared to the primary replica. Dividing the redo_queue by the redo_rate will indicate your lag.

Tredo = redo_queue / redo_rate

RTO metric query


Synchronous performance

Everything discussed thus far has revolved around recovery in asynchronous commit mode. The final aspect of synchronization lag that will be covered is the performance impact of using synchronous commit mode. As mentioned above, synchronous commit mode guarantees zero data loss but you pay a performance price for that.

The impact to your transactions due to synchronization can be measured with performance monitor counters or wait types.

Calculations

Performance monitor counters

Tcost = Ttransaction delay /Tmirrored_write_transactions

Simple division of the sec and transaction delay counters / mirrored write transactions will provide you with your cost of enabling synchronous commit in units of time. I prefer this method over the wait types method that I will demonstrate next because it can be measured at the database level and calculate implicit transactions.I prefer this method over the wait types method that I will demonstrate next because it can be measured at the database level and calculate implicit transactions. What I mean by that is, if I run a single INSERT statement with one million rows, it will calculate the delay induced on each of the rows. The wait types method would see the single insert as one action and provide you with the delay caused to all million rows. This difference is moot for the majority of OLTP systems because they typically have larger quantities of smaller transactions.

Wait type – HADR_SYNC_COMMIT

Tcost = Twait_time / Twaiting_tasks_count

The wait type counter is cumulative which means that you will need to extract snapshots in time and find their differences or perform the calculation based on all activity since the SQL Server instance was last restarted.

Synchronization metric queries

Performance monitor counters method

NOTE: This script is much longer than the previous ones. That was because I chose to demonstrate how you would sample the performance counters and calculate off of a recent period of time. This metric could be accomplished with the up-time calculation demonstrated above as well.


Wait types method

NOTE: For brevity I did not use the above two-check method to find the recent wait types but the method can be implemeneted, if you chose to use this method.

Take-away

At this point, you should be ready to select a measurement method for your asynchronous or synchronous commit AGs and implement baselining and monitoring. I prefer the log send queue method for checking on potential data loss and the performance monitor counter method of measuring the performance impact of your synchronous commit replicas.

Derik Hammer
Always On Availability Groups, Recovery

About Derik Hammer

Derik is a data professional focusing on Microsoft SQL Server. His passion focuses around high-availability, disaster recovery, continuous integration, and automated maintenance. His experience has spanned database administration, consulting, and entrepreneurial ventures. Derik thanks our #sqlfamily for plugging the gaps in his knowledge over the years and actively continues the cycle of learning by sharing his knowledge with all and volunteering as a PASS User Group leader. View all posts by Derik Hammer

168 Views