Automatic Page Repair in SQL Server Always On Availability Groups

This is the 9^th article in the series of SQL Server Always On Availability Groups.

Introduction

SQL Server Always On is a disaster and high availability feature. In the previous articles, we explored the following features.

AG failover in case of primary replica goes down
Manual failover
Database level health detection that initiates a failover in case any AG database changes its state from online and SQL Server not able to write in the transaction log

In the above points, we safeguard SQL Server for the instance and DB failover. Suppose your AG replica is healthy, but one of your AG databases becomes corrupt. You get a logical consistency error if you execute the DBCC CHECKDB command. You might have an excellent copy of that database page in another participating replica. Let’s find the answer to a few questions in this article.

How does SQL Server Always On help us to recover the corrupt page in this scenario?
What happens if a page corrupts in the AG database on the primary replica?

Prerequisites

You can follow the series articles and prepare a two-node SQL Server Always On Availability Group, as shown below.

Current Primary replica: SQLNode1\INST1
Current Secondary replica: SQLNode2\INST1
AG Mode: Synchronous commit

Automatic Page Repair in SQL Server Always On Availability Groups

SQL Server supports automatic page repair for the databases participating in AG configuration. If SQL detects specific corruption errors (Error id 823,824 & 829) for an AG database, SQL Server tries to restore the page from the corresponding replica(primary or secondary).

It works for the following consistency errors.

Error 823: We get this error if the operating system performs a cyclic redundancy check that failed on the data
Error 824: Logical consistency errors such as bad page checksum, torn page detection
Error 829: SQL Server raises error 829 if a page is marked as restore pending

Resolve a page corruption issues on the Primary replica in SQL Server Always On Availability Groups

Let us consider that my AG database is in the synchronized state and communicating with the secondary replica. Now, you get the IO error in an AG database on the primary replica. The below diagram shows the high-level steps for the automatic page repair if any page corrupts at the primary replica.

It inserts a row in the suspect_pages table in the MSDB system database. This table logs the corruption information such as the database id, file id, page id, error id
It broadcasts a request to all secondary replica
It might get a response from the multiple secondary replicas. It gets the page from the replica that responds first. This request also specifies the page ID and LSN that is currently at the end of the log
SQL Server marks the specified page as restore pending. If any user tries to access the page status restore pending, he gets error 829
Once the secondary replica receives a page request, it waits until it has redone the log up to the LSN primary replica specified in the request
Secondary replica access the required page and sends it to the primary replica. If the secondary replica could not access the page, it returns an error to the primary replica. It causes the failure of automatic page repair
If the automatic page repair is successful, it marks the status of the page in the suspect_pages table as restored (event_type 5)
SQL Server tries to resolve the deferred transactions after the successful page repair

Let’s simulate the automatic page repair in the SQL Server Always On Availability Group.

Note: For this article, we corrupt a page in the primary replica intentionally. DO NOT EVER perform these steps in a production database. It is for educational purposes only.

Create a [Employee] table in the [MyNewDB] database and insert sample records in it.

CREATE TABLE [Employee]

(

[ID] [int] IDENTITY(1, 1) NOT NULL ,

[FirstName] [varchar](50) NOT NULL ,

[LastName] [varchar](50) NOT NULL

)

INSERT INTO [Employee]

( FirstName, LastName )

VALUES ( 'Rajendra', 'Gupta' ),

( 'Mohan', 'Sharma' ),

( 'Roy', 'John' ),

( 'Ram', 'Krishna' )

Use the DBCC IND command to check the database pages in the [MyNewDB] database.

1 2	DBCC IND (N'MyNewDB', N'Employee', -1); GO

Verify if the AG dashboard is healthy.

Stop SQL Services on both Primary and Secondary AG replicas using SQL Server Configuration Manager. You should stop SQL Services on secondary replica before the primary replica.

We use a Hex Editor in this article to open the database file. You can download it from the URL.

Launch Hex Editor and open the MyNewDB.MDF file. You might get the permission error. Make sure you permit your account on the database file.

It opens the [MyNewDB.MDF] content, along with the decoded text.

Now, search for an employee name that we entered initially. Let’s search for it for the employee name Roy.

Modify the numbers on the left-hand side and save the MDF file changes.

Start the SQL Services for the primary replica and secondary replica.

If we try to select the records from the [Employee] table, it gives you logical consistency error, as shown below. It reports an incorrect checksum for the page 1:336 in the [MyNewDB] database:

As highlighted earlier, SQL Server puts an entry in the suspect_pages table in the MSDB database. You can see an entry for the page_id 336. Here, we see event_type 5 that shows the page is restored from the secondary replica:

We can also validate it using the DMV sys.dm_hadr_auto_page_repair:

In the DMV output, we get the following useful columns.

Page_id: It is the page id that was restored from the secondary replica
Error_type:
- 1 = 824 error
- 2 = Bad checksum
- 3 = Torn Page
- -1 = All hardware 823 errors
Page_status: you might get the following page status in this column
- 2 = It is queued for a request from the secondary replica
- 3 = request is sent to the partner
- 4 = Page is successfully repaired
- 5 = Page could not be repaired. Automatic page repair will try to repair the page again

In our case, we get event_id 5 and page_status 4 that shows page corruption was due to torn page, but automatic database repair restored it successfully. You can query the [employee] table again. You can notice here that we still have employee Roy details because SQL Server replaced the modified (corrupted) page in the primary replica from the secondary replica.

Resolve a page corruption issues on the secondary replica in SQL Server Always On Availability Groups

Previously, we intentionally corrupted a page in the primary replica. Let’s perform the same demonstration in the secondary replica.

Start the SQL Services and try to access the [Employee] table in the secondary replica database. You might get the following error message.

By default, the secondary replica is not allowed for a connection. You can open the availability group properties from the primary replica and modify the value for Readable Secondary to Yes.

Now, go back to the secondary replica and access the employee table. It reports a logical consistency I/O error. It has an issue to read the page 1:336.

Wait for some time and check the entry in the suspect_pages table of the MSDB database in the secondary replica. It has an entry for page 1:336 with event_type 5.

You can query the DMV sys.dm_hadr_auto_page_repair and verify that the page is repaired successfully (page_status 4).

You can read the page in the secondary replica successfully.

Your AG dashboard is healthy as well.

The overall process of automatic page repair if page corrupts in a secondary replica are as below.

SQL Server Always On Availability group enters into the suspended state for the secondary replica
It enters the corrupt page information in the suspect_page table of the secondary replica
Secondary replica puts a request to get a page copy from the primary replica
If the primary replica can access the page, it sends the page to the secondary replica
Secondary replica repairs the page, and AG is in the synchronized state again

At a high-level, we can summarize the automatic page repair if page corrupts at the secondary replica as shown below:

Restrictions of automatic page repair in SQL Server Always On Availability Groups

SQL Server cannot recover or restore the following pages as part of automatic page repair functionality.

A file header page
A database boot page
Global Allocation Map (GAM) Pages
Shared Allocation Map (SGAM) pages
Page Free Space (PFS) pages

Conclusion

In this article, we explored that SQL Server Always On Availability Groups give us additional flexibility of automatic page repair in case of any corruption issues. It is an asynchronous process and automatically fixes the corruption for you.

A comprehensive guide to SQL Server Always On Availability Groups on Windows Server 2016

Configure Domain Controller and Active Directory for SQL Server Always On Availability Groups

Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups

Install SQL Server 2019 on Windows Server 2016 with SQL Server Always On Availability Groups

Add a new node into existing SQL Server Always On Availability Groups

Configure Managed Service Accounts for SQL Server Always On Availability Groups

Add or remove a node from SQL Server Always On Availability Groups using T-SQL scripts

Database-level health detection in SQL Server Always On Availability Groups

Automatic Page Repair in SQL Server Always On Availability Groups

Deploy a domain-independent Windows Failover Cluster for SQL Server Always On Availability Groups

Configure a SQL Server Always On Availability Group on the domain-independent Failover Cluster

An overview of distributed SQL Server Always On Availability Groups

Deploy a distributed SQL Server Always On Availability Group

Monitor and failover a Distributed SQL Server Always On Availability Group

Transparent Data Encryption for SQL Server Always On Availability Groups

Configure SQL Server replication for a database in SQL Server Always On Availability Groups

Configuring SQL Server replication for distribution databases in SQL Server Always On Availability Groups

Explore Cross-database MSDTC for distributed transactions in SQL Server Always On Availability Groups

Deploy MSDTC for distributed transactions in SQL Server Always On Availability Groups

Restore an existing availability group database participating in SQL Server Always On Availability Groups

Exploring AG dashboards for monitoring SQL Server Always On Availability Groups

Backup compression in TDE enabled databases in SQL Server Always On Availability Groups

Impact of dropping a login in the active directory tied to SQL Server Always On Availability Groups

Refresh SQL Server Always On Availability Group databases using DBATools PowerShell

Configure SQL Server Reporting Services databases in SQL Server Always On Availability Groups

Suspend and Resume Data Movement in SQL Server Always On Availability Groups

Explore failover types in SQL Server Always On Availability Groups

Explore SQL Server PolyBase External Tables in SQL Server Always On Availability Groups

SQL Server Always On Availability Groups for SQL Server Linux instances

Column-level SQL Server encryption with SQL Server Always On Availability Groups

Make the most of secondary replicas in SQL Server Always On Availability Groups

Apply SQL Server patches or cumulative updates in SQL Server Always On Availability Groups

Monitor SQL Server Always On Availability groups using extended events

The Hub and Spoke model of policy-based management for SQL Server Always On Availability Groups

Custom policies for AG dashboards of SQL Server Always On Availability Groups

Explore dynamic management views for monitoring SQL Server Always On Availability Groups

Dynamic management views for monitoring availability replicas and databases for SQL Server Always On Availability

Configure SQL Server Always On Availability Groups using Windows PowerShell scripts

Configure Integration Services Catalog Database SSISDB in SQL Server Always On Availability Groups

Synchronize logins between Availability replicas in SQL Server Always On Availability Group

Session timeouts in SQL Server Always On Availability Groups

Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

Author
Recent Posts

Rajendra Gupta

Hi! I am Rajendra Gupta, Database Specialist and Architect, helping organizations implement Microsoft SQL Server, Azure, Couchbase, AWS solutions fast and efficiently, fix related issues, and Performance Tuning with over 14 years of experience.

I am the author of the book "DP-300 Administering Relational Database on Microsoft Azure". I published more than 650 technical articles on MSSQLTips, SQLShack, Quest, CodingSight, and SeveralNines.

I am the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups.

Based on my contribution to the SQL Server community, I have been recognized as the prestigious Best Author of the Year continuously in 2019, 2020, and 2021 (2nd Rank) at SQLShack and the MSSQLTIPS champions award in 2020.

Personal Blog: https://www.dbblogger.com
I am always interested in new challenges so if you need consulting help, reach me at rajendra.gupta16@gmail.com

View all posts by Rajendra Gupta

Latest posts by Rajendra Gupta (see all)

How to install PostgreSQL on Ubuntu - July 13, 2023
How to use the CROSSTAB function in PostgreSQL - February 17, 2023
Learn the PostgreSQL COALESCE command - January 19, 2023

Introduction

Prerequisites

Automatic Page Repair in SQL Server Always On Availability Groups

Resolve a page corruption issues on the Primary replica in SQL Server Always On Availability Groups

Resolve a page corruption issues on the secondary replica in SQL Server Always On Availability Groups

Restrictions of automatic page repair in SQL Server Always On Availability Groups

Conclusion

Table of contents

Related posts: