In this 41st article for SQL Server Always On Availability Groups, we will explore the session timeout configurations for AG replicas.
You should follow articles in the series, SQL Server Always On Availability Groups and prepare the following environment.
- You require a minimum of two-node synchronous commit AG replicas
- AG replicas must be in a synchronous and healthy state
- You should get a basic understanding of the availability group, its components and functionality
Overview of the Hard and Soft errors
In a SQL AG configuration, we configure two or more replicas in either synchronous or asynchronous commit mode. These replicas are part of a Windows failover cluster configuration. We might face errors such as hardware issues, operating system issues, networking or storage issues, DNS issues, packet drops between replicas. We usually classify these errors as Hard errors or Soft errors that could cause issues in database synchronizations. The error detection speed and reaction time for a failure depends on hard or soft errors. For example, a server or network failure is reported immediately. The timeout for specific components is not handled at the SQL Server level. SQL Server handles the soft error as per its internal session-timeout mechanism.
What is a hard error or soft error? Let’s explore it in the upcoming part.
The availability replica could cause a session failure in case of any errors on the hardware or the operating system level. For example, if the log drive goes unresponsive on the primary replica, SQL Server gets notified about the critical error.
The example of hard errors can be:
- Network issues
- Firewall issues
- operating system issues
- Log drive issues on the primary replica.
- DNS related issues
In case of a hard error, the affected components notify SQL Server about the failures.
SQL Server Always On Availability Groups implements a session time mechanism to detect soft errors. Suppose you have a synchronous secondary replica, but it went into the hung state. You cannot RDP to secondary replica or connect remotely to SQL instance. In this case, OS would not notify the SQL Server process about the error, and the primary replica might wait for infinite for the server to respond. In the case of the synchronous commit, a long queue might build-up for the transaction to commit first on the secondary replica.
To prevent the above scenario, AG implements a session-timeout mechanism. By default, the session timeout is set for 10 seconds. It sends out a ping request response from each connected replica at a specific interval. The successful ping response reveals that AG replicas can communicate with each other. It also resets the timer on a successful ping response.
If another replica does not give a ping response in a configured session-timeout period, you get a connection timeout message in the error log. The replica goes into the DISCONNECTED state. If you have the synchronous replica, it automatically converts internally into the asynchronous mode that means the transaction does not wait for transaction acknowledgement from the secondary replica.
The timeout due to soft errors might happen due to issues such as:
- Network packet drops, corrupted or incorrect order packet
- Hung status of replica or database
- Server timeout issues
- Insufficient resources such as high CPU utilization, Disk latency, transaction log space
- Thread exhaustion
Verify Session time configuration in SQL Server Always On Availability Groups
View the configuration using SSMS
To verify the session-timeout configuration, connect to the primary replica and open the AG replica properties. In the availability replicas, you get session timeout (seconds) as shown below.
View the configuration of SQL Server Always On Availability Groups using DMV
You can also query the sys.availabilty_replicas to check the current value of session timeout for SQL Server Always On availability group.
select replica_server_name,endpoint_url, availability_mode_desc,session_timeout
Validate the behavior of Session timeout for synchronous replicas
We explained earlier that if there is no ping response from the other replica in configured session timeout, AG automatically works as an asynchronous replica whether you have configured it for synchronous commit. In this section, let’s validate and observe this behavior practically.
To see the session timeout configuration in action, we will intentionally generate a timeout scenario for the secondary replica. First, verify that SQL Server Always On Availability Groups is in healthy status.
In the below image, we have the following configurations.
- Primary AG replica: SQLNode2\INST1
- Secondary AG replica: SQLNode1\INST1
- AG database: MyNewDB
- Session timeout: Default 10 seconds
- No data loss failover mode
Create the following [AGSessionTimeOut] table on primary replica in availability group database [MyNewDB]. Next, we use a while loop to insert the records into the [AGSessionTimeOut] table and wait for 1 second (using the WAITFOR DELAY command) before inserting the next record.
Create table MyNewDB.dbo.AGSessionTimeOut
ID int IDENTITY(1,1),
INSERT INTO MyNewDB.dbo.AGSessionTimeOut(RecordTimestamp)
WAITFOR DELAY '00:00:01'
We have a synchronous AG secondary replica, so it waits for transaction acknowledgement and commits records on the primary replica. After some time, stop the secondary replica SQL instance. It causes connection timeouts for the secondary replica. Now, stop insert the statement on the primary replica after 20-30 seconds.
To troubleshoot the connection timeout, check the SQL Server error log on the primary replica. In the below image, we see that it recorded a connection timeout error at 2020=11-24 11:34:13 AM.
exec xp_readerrorlog 0,1,N'Connection timeout'
Now, query the [AGSessionTimeOut] table. We can notice the following events in the below image.
- SQL query inserts records in the [AGSessionTimeOut] table every second till 11:34:03 AM
- We observe a gap of 10 seconds for the next data insert (between rows 9 & 10)
- The next record was inserted at 11:34:14 AM
The reason for the delay is as below:
SQL Service stopped at the secondary replica at 11:34:03 AM, as shown in the following event viewer entry. Till this point, the AG replica is working in the synchronous commit mode
- Since the SQL Service is not down on the secondary replica, the primary replica could not receive a ping response, and it waits for 10 seconds as per the configured session timeout value
- SQL Server error logs record connection timeout after completing the session period, i.e. 11:34:13
- Query inserts next records every second starting from 11:34:14 AM. At this point, despite synchronous mode, AG works as an asynchronous commit and inserts records without waiting for any acknowledgement from the secondary replica. You do not get any prompt for switching synchronous to asynchronous mode
Modify session timeouts in SQL Server Always On Availability Groups
We should set an appropriate session timeout in the availability group depending upon our infrastructure requirements. The session timeout value should set session timeout greater than or equal to 10 seconds. It is not recommended to use session timeouts below 10 seconds because we might have the possibility of missing ping response in a busy system, and it could cause a false alarm.
To modify the default value, change the session time in the availability group properties from the primary replica.
Alternatively, you use the ALTER AVAILABILITY GROUP statement to modify the session_timeout values. The below scripts modifies the session timeout value to 20 seconds.
ALTER AVAILABILITY GROUP [AGusingPowerShell]
MODIFY REPLICA ON N'SQLNode1\INST1' WITH (SESSION_TIMEOUT = 20)
ALTER AVAILABILITY GROUP [AGusingPowerShell]
MODIFY REPLICA ON N'SQLNode2\INST1' WITH (SESSION_TIMEOUT = 20)
You cannot set a value for SESSION_TIMEOUT in less than 5 seconds. If you try to do so, you get the following error message.
If there is any delay in data commits on the primary replica due to soft errors, you can notice a high value of HADR_SYNC_COMMIT wait on the primary instance.
In this article, we explored the session timeout in SQL Server Always On Availability Groups for handling soft errors. It enables SQL Server to continuously writing data into the primary replica even if AG is in synchronous mode and the secondary replica does not respond for acknowledgement. It helps to understand the AG related errors, their causes and steps to tackle those errors.