Rajendra Gupta

Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

March 8, 2021 by

In the 42nd article for SQL Server Always On Availability Groups, we are going to explore lease timeouts and health checks.

Introduction

SQL Server Always On Availability Groups provides resilience for high availability and disaster recovery solution in a multi-node architecture. In the article, Session timeouts in SQL Server Always On Availability Groups, we explained the session timeout configurations to safeguard AG from soft errors.

Suppose you have an availability group configured for automatic failover. In this synchronized availability group, we can have the following types of failover.

  • Automatic
  • Manual
  • You can do forced failover with possible data loss

In the case of automatic failover, the availability group fails over to a synchronized secondary replica without any data loss. However, an automatic failover requires the satisfied conditions defined in the flexible failover policy.

The flexible failover policy depends on the Health-check timeout threshold, Failure-Condition Level and Cluster timeouts.

It is an important aspect to determine the auto-failover causes in a production database environment. The logs are useful factors for monitoring and investigating the failures.

  • SQL Server error logs
  • Windows cluster logs
  • Cluster event logs
  • SQL Server failover diagnostics (sp_server_diagnostics) logs
  • AlwaysOn_Health extended events output
  • System_health extended events
  • Application and system logs

The availability group feature requires an underlying Windows Failover Clusters. The cluster requires cluster services on all participating nodes.

The resource host acts as an interface between the cluster resource and service. SQL Server uses resource DLLs for the resource monitor. These DLLs are responsible for health monitoring and offline-online resource management.

In the below image, we see two levels of monitoring in an AG configuration.

  • Looks-Alive
  • Is-Alive

Looks-Alive mechanism

SQL Server Always On Availability Groups levels of monitoring

In a Windows failover cluster, the cluster service continuously transmits heartbeat to other cluster nodes and waits for the acknowledgment. If it does not receive a heartbeat acknowledgment after a series of heartbeats, it assumes that the corresponding node is down. Subsequently, it broadcasts a message to other cluster nodes, and a node takes over the cluster ownership based on the quorum votes and majority configuration. If due to any reason, the cluster could not determine the quorum majority, all nodes in a cluster go into the resolving state and all cluster resources are taken offline. This process is known as Looks-Alive and runs every 5 seconds. The Windows failover cluster uses the Looks-Alive mechanism for an integration between the cluster resource host and SQL process. On the primary replica, SQL Server uses a dedicated lease worker thread.

dedicated lease worker thread

Go to Windows Failover Cluster and right-click on the cluster resource. In the properties, you get configured values for LeaseTimeout. By default, it is set to 20000 milliseconds (20 seconds).

Lease Timeout

The lease worker thread and resource host work circularly, as shown in the below image.

lease worker thread and resource host

Image reference: Microsoft docs

In the SQL Server logs on the primary replica, you get a message from the lease worker. The lease worker thread and resource host maintains a time-to-leave (TTL), and it gets updated each time threads wait up after a signal from another resource.

SQL Server Always On Availability Groups SQL Server logs

If the lease timeout period elapsed while waiting for the signal, the AG resource goes into the resolving state. In the case of AG failover, SQL Server logs another entry to stop the lease renewal.

lease timeout period

Is-Alive mechanism

The SQL Server Always on resource DLL uses the sp_server_diagnostics stored procedure for the health of SQL Service. The stored procedure reports the status for the system, resource, IO subsystem, query processing and events. It uses a FAILURE_CONDITION_LEVEL for implementing defining a condition in which automatic failover can happen.

Failover Condition Level

Description

1

Automatic failover occurred due to server down or lease expiry.

2

No data returned from sp_server_diagnostics. It can be related to the hung server issue. The automatic failover can happen due to SQL Server is not connected to the cluster or due to user-defined health check timeout.

3 (Default level)

System error ( stack dumps, orphaned spinlocks, serious write-access)

4

Resource error (For example- persistent out-of-memory condition)

5

Query processing (for example – Scheduler deadlock or unresolvable deadlock)

A failover condition encompasses conditions lower than configured conditions as well. For example, the default condition 3 also uses condition levels 1 and 2.

You can use the ALTER AVAILABILITY GROUP to set a specific failover condition level.

But how long resource DLL with wait for the output from the stored procedure sp_server_diagnostics? The resource DLL uses the timeout properties for this purpose. By default, the timeout is 30,000 milliseconds, i.e. 30 seconds. You can set the value for a timeout from 15 seconds to 4294967 seconds.

You can modify the health check timeout from the following ALTER AVAILABILITY GROUP command.

We can query sys.dm_xe_session to check the extended events for the sp_server_diagnostics session.

Is-Alive mechanism

The output of these events is stored in the default log directory. For example, in my environment, these files are available in the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\MSSQL\Log directory.

default log directory

Lease Time out in SQL Server Always On Availability Groups

As shown below, the lease timeout is 20 seconds. By default, the Windows failover cluster renewals the lease every 5 seconds, i.e. ¼ time of the lease timeout.

SQL Server Always On Availability Groups Lease Timeout

If lease timeout occurs ( >20 seconds), the rhs.exe reports an error in the Windows cluster, and it starts taking preventive action.

You can get the following error messages for a least timeout issues:

error messages

SQL Server Always On Availability Groups Health Check timeout

SQL Server Always On Availability Group performs a health check of the primary replica using the sp_server_diagnostics stored procedure. The sp_server_diagnostics executes every 10 seconds. The health check timeout is 30 seconds.

Therefore, the stored procedure returns the result on 1/3 * health checks current threshold. If the SP does not return any results, AG refers to the previous state for determining the instance health until the health-check timeout threshold. The next Is-Alive determines that the primary replica is unresponsive, and it initiates the automatic failover.

WSFC configuration and cluster timeouts for SQL Server Always On Availability Groups

In a Windows failover cluster, the following parameters control the cluster timeout values:

  • SameSubnetDelay
  • SameSubnetThreshold
  • CrossSubnetDelay
  • CrossSubnetThreshold

You can check these parameter values using the Windows PowerShell cmdlet get-cluster().

WSFC configuration and cluster timeouts

In my lab environment, the servers are located in the same subnets. Therefore, the following table shows the default values for Windows Server 2012 and 2016.

Parameter

Windows Server 2012 R2

Windows Server 2016

SameSubnetDelay

1 second

1 Second

SameSubnetThreshold

5 heartbeats

10 heartbeats

The SameSubnetDelay defines the wait time between the heartbeats. It is 1 second for both Windows Server 2012 and 2016.

The SameSubnetThreshold defines the number of missing heartbeats a cluster can tolerate, and if it does not receive the acknowledgement, the target node is declared dead.

Recap of LeaseTimeout, Session-Timeout and Health check timeouts

  • Lease timeout:
    • Default value: 20000 milliseconds, i.e. 20 seconds
    • It is required to prevent a split-brain scenario in the Windows failover cluster
    • It can trigger an AG failover or offline-online
    • It is used in both Is-Alive and the Looks-alive mechanism
  • Session Timeout:
    • It is used to safeguard against soft errors between AG replicas
    • default value: 10000 milliseconds or 10 seconds
    • It is not part of the Is-Alive or the Looks-alive mechanism
    • Secondary replica becomes DISCONNECTED status due to session timeout
    • You can configure the session timeouts in the availability group properties from the primary replica instance
  • Health Check timeout
    • Default value: 30000 milliseconds, i.e. 30 seconds
    • It is the timeout if the sp_server_diagnostics does not return any data or reports errors in the health check
    • It is used in both Is-Alive and the Looks-alive mechanism
    • It also depends on the failover condition levels defined from 1-5, as explained earlier. The default configuration is level 3

Conclusion

In this article, we figured SQL Server Always On Availability Groups flexible failover policy using the lease, health check timeout and Failover Condition Levels. You should review the logs to determine the scenarios where the automatic failover occurred or failed.

Table of contents

A comprehensive guide to SQL Server Always On Availability Groups on Windows Server 2016
Configure Domain Controller and Active Directory for SQL Server Always On Availability Groups
Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups
Install SQL Server 2019 on Windows Server 2016 with SQL Server Always On Availability Groups
Add a new node into existing SQL Server Always On Availability Groups
Configure Managed Service Accounts for SQL Server Always On Availability Groups
Add or remove a node from SQL Server Always On Availability Groups using T-SQL scripts
Database-level health detection in SQL Server Always On Availability Groups
Automatic Page Repair in SQL Server Always On Availability Groups
Deploy a domain-independent Windows Failover Cluster for SQL Server Always On Availability Groups
Configure a SQL Server Always On Availability Group on the domain-independent Failover Cluster
An overview of distributed SQL Server Always On Availability Groups
Deploy a distributed SQL Server Always On Availability Group
Monitor and failover a Distributed SQL Server Always On Availability Group
Transparent Data Encryption for SQL Server Always On Availability Groups
Configure SQL Server replication for a database in SQL Server Always On Availability Groups
Configuring SQL Server replication for distribution databases in SQL Server Always On Availability Groups
Explore Cross-database MSDTC for distributed transactions in SQL Server Always On Availability Groups
Deploy MSDTC for distributed transactions in SQL Server Always On Availability Groups
Restore an existing availability group database participating in SQL Server Always On Availability Groups
Exploring AG dashboards for monitoring SQL Server Always On Availability Groups
Backup compression in TDE enabled databases in SQL Server Always On Availability Groups
Impact of dropping a login in the active directory tied to SQL Server Always On Availability Groups
Refresh SQL Server Always On Availability Group databases using DBATools PowerShell
Configure SQL Server Reporting Services databases in SQL Server Always On Availability Groups
Suspend and Resume Data Movement in SQL Server Always On Availability Groups
Explore failover types in SQL Server Always On Availability Groups
Explore SQL Server PolyBase External Tables in SQL Server Always On Availability Groups
SQL Server Always On Availability Groups for SQL Server Linux instances
Column-level SQL Server encryption with SQL Server Always On Availability Groups
Make the most of secondary replicas in SQL Server Always On Availability Groups
Apply SQL Server patches or cumulative updates in SQL Server Always On Availability Groups
Monitor SQL Server Always On Availability groups using extended events
The Hub and Spoke model of policy-based management for SQL Server Always On Availability Groups
Custom policies for AG dashboards of SQL Server Always On Availability Groups
Explore dynamic management views for monitoring SQL Server Always On Availability Groups
Dynamic management views for monitoring availability replicas and databases for SQL Server Always On Availability
Configure SQL Server Always On Availability Groups using Windows PowerShell scripts
Configure Integration Services Catalog Database SSISDB in SQL Server Always On Availability Groups
Synchronize logins between Availability replicas in SQL Server Always On Availability Groups
Session timeouts in SQL Server Always On Availability Groups
Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

Rajendra Gupta
Always On Availability Groups, Installation, setup and configuration, Monitoring

About Rajendra Gupta

As an MCSA certified and Microsoft Certified Trainer in Gurgaon, India, with 13 years of experience, Rajendra works for a variety of large companies focusing on performance optimization, monitoring, high availability, and disaster recovery strategies and implementation. He is the author of hundreds of authoritative articles on SQL Server, Azure, MySQL, Linux, Power BI, Performance tuning, AWS/Amazon RDS, Git, and related technologies that have been viewed by over 10m readers to date. He is the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups. Based on his contribution to the SQL Server community, he has been recognized with various awards including the prestigious “Best author of the year" continuously in 2020 and 2021 at SQLShack. Raj is always interested in new challenges so if you need consulting help on any subject covered in his writings, he can be reached at rajendra.gupta16@gmail.com View all posts by Rajendra Gupta

4,313 Views