This article discusses useful considerations for SQL Server Disaster Recovery.
An overview of RPO, RTO and SLA
In the digital era, data availability and security is a critical task for every organization. It is a database administrator’s duty to create a fail-safe mechanism to ensure your database is available as per the defined RPO, RTO and SLA.
- Recovery Point Objective (RPO): It refers to the maximum amount of data you can afford to lose. It is aligned with your database backups and recovery strategies. For example, if you take database backup every hour, you can lose a maximum of one hour of data. If your RPO is 10 minutes, you are not in a good position from a data recovery point of view
In the below image, we see the RTO and RPO corresponding to SQL Server Disaster Recovery.
Recovery Time Objective (RTO)
Once a disaster occurs, DBA’s primary responsibility is to recover your database asap for business continuity. Therefore, the RTO is a measure of how long your organization can afford the downtime of databases before things come back to normal. In the above image, we see the recovery time objective after a disaster occurs. Usually, we should define our system and DR process in such a way to recover asap in less than the RTO.
Service Level Agreements (SLA)
The third useful parameter is the Service Level Agreement between the customer and the vendors. It covers the service quality, availability and responsibilities. It is defined based on the RPO and RTO of organization requirements for SQL Server Disaster Recovery.
There are several parameters you should consider meeting the recovery objectives (RPO, RTO) and SLA’s regarding the databases for SQL Server Disaster Recovery.
Define Database backups policy
You must define the backup policy for both critical and non-critical database systems. In SQL Server, we combine the full, differential, transaction log backups for restoring databases in case of any issues.
- Combine different backup mechanism to ensure you minimize the downtime and effort in the restoration. For example, you use a combination of full, differential and transaction log backups for large databases while for smaller databases, you can take full backups and regular log backups. You can also leverage filegroup backups, piecemeal restores, the smart transaction log and differential backup as well to take backup depending upon data changes
- Store the backups on a different media such as SAN, Tape, Cloud URL. You should not store the backups on the drives where your data and log file exists. In that particular drives goes down, you lose both data and backups
- You must configure the backups even if you have high availability infrastructure such as VM snapshots, Storage level replication, Windows failover clusters or SQL Server Always On Availability Groups
Regular test the database backups
Sometimes, database professional feels good that they have 100% compliance on database backup. You also report 100% backup compliance to your regular reports to management. But, someday system crashes, and your backup would not work.
To avoid this humiliating scenario, database professionals must conduct regular database restoration drives. In these drives, you can choose a random restore point ( RPO) and restore your backups on a test environment and validate that you meet the RPO for SQL Server Disaster Recovery. It ensures your backup policy is aligned with your organization requirement as well as it boosts your confidence in database backups for any unforeseen situation.
Importance of System databases
Recently, I saw one scenario in which a SQL instance had a few critical databases for one of my customers. They use a third-party tool for their database backups, and it was configured to take a backup of all databases. All looks good at this point.
One day, due to a storage issue, the instance crashed. After all efforts, they could not recover the instance, so they built a new instance and tried to restore all databases. All user databases were restored but they realized that the backup tool was configured for backup of only user databases. As you did not take the system database backups, you can lose logins, credentials, proxies, jobs. It is a difficult situation, right.
Therefore, to avoid any such issues, always ensure to take database backups for the system databases. The system databases size is small, and you can take regular full backups quickly.
Database Consistency Check
You should add a database consistency check for your weekly or monthly maintenance schedule. It ensures the databases do not report any logical or physical consistency issues. If you perform regular restoration drills( as suggested above), run the consistency check on the restored database as well. It ensures your existing database backup does not hold consistency issues.
DR exercises for SQL Server Disaster Recovery
It is like a dry run to validate your preparation for the database recovery. Although it might not replicate the existing DR scenario, it is similar to a mock test for your strategies, preparations for handling any unforeseen issues. If you have configured the high availability scenarios such as SQL Server Always On Availability Groups, Failover clusters, then you should make sure you can failover for the secondary site. It would be best if you also asked your application teams to participate in the DR drills. They can also validate that the application works fine once you restore or recover the database.
Know your database infrastructure
A database professional should document the supporting infrastructure. Suppose after a disaster occurs, you prepared another SQL instance and restored the database backups. Your databases are online, but it might need additional configurations such as a linked server, Alias, port ( static\dynamic), MAX memory, MAXDOP, Parallelism, encryption, SQL Server edition, version, replication. For example, if you use replication to another production server, you should have details of the replication articles, subscribers, agent security, object filters. If you have documented the infrastructure, it becomes convenient for you to refer the documentation and prepare the things. You can schedule an agent job to script out the login security and permissions regularly at a shared location.
Using this documentation, you can take the right approach for database availability. For example, in the case of a TDE enabled database, you need a different approach for database backup and restores using the SQL Server certificates.
Use High availability and SQL Server Disaster Recovery solutions
You can use the high availability and disaster recovery solutions for your critical databases. These solutions could be SQL Server Always On Availability Groups, log shipping. Here, you maintain multiple copies of the existing databases and allows you to configure the standby copies of databases as well. It helps you to quickly redirect applications to the stand by or secondary database in case the primary database has issues. In the case of the AG group, you should ensure having a secondary synchronous copy of your critical databases. AG allows you to create multiple secondary replicas in your primary and secondary sites. Make sure you monitor the latency between the primary and secondary database copies. Simultaneously, you can create a DR database copy using the asynchronous data commit.
A checklist is also essential for your SQL Server Disaster Recovery solutions. These checklists can have the following information.
- You should have the Contact owner list for databases and applications. You should not look for the contact persons once a disaster strikes
- Be ready with your organization escalation matrix, application, vendor support checklist
- Critical contacts for Vendor support such as Microsoft premier support number to raise a high priority incident
- Always note down the learning from your DR drills and recovery scenarios
- Application dependency checklists
In this article, we explained several considerations for your SQL Server Disaster Recovery for SQL Server databases. The DR requirement might be different for different infrastructure. You should evaluate your requirements, recovery objectives and plan the solution that works best for you.