Preparing for SQL Server Database Corruption; initial reaction and analysis

Corruption is a looming thought through every administrator’s mind, from sysadmins to database administrators there is always a monster hiding in the shadows. SQL Server Database corruption happens all the time around the world and while most of us have been lucky enough to avoid it, we should still be prepared. While there is no way to prevent the corruption from happening, we must work hard to implement practices that minimize damage caused by the corruption. This means good backups and of course running DBCC CHECKDB.

However, this is not enough to keep you out of the dog house. Many administrators do not notice they have corruption until it is to late. SQL Server does checking automatically but its only on pages that have been read into memory, if you have a large Database or do not hit those pages until year end reports it is possible your backups will not go far enough back to allow you to quickly recover that corrupted page (talked about in How to perform a page level restore in SQL Server article) or table. In short, if you notice you have corruption sooner it increases you odds of recovery with minimal to no data loss along with decreasing the amount of downtime required.

In this series, we will discuss what to do when you encounter corruption, how to know you have a corrupted database, some basic recovery techniques and some other pointers.

Do not panic

When people first find out they have corruption they instinctively start to freak out. This can cause you to make some chaotic decisions that put your company in a worse position. Such as running a restore or repair with data loss as the first option, these can cause massive data and monetary losses for the company and should be only done in an absolute necessity. I invite you to read the section on running repair allow data loss on the Microsoft docs, their warning is a good enough reason to make sure you are properly monitoring your database. Describing, “no full ACID guarantee” to logical consistency and business logic needs to be validated manually after running. Additionally, according to this infographic from Small business trends, 60% of small businesses that lose data will go out of business within 6 months.

^[1]

When first receiving that corruption notification people will try to do a slew of items that will not help them.

Restarting SQL Server
- This just delays the problem and causes the system to run through crash recovery on the databases. Not to mention, in most systems, you will not be able to do this right away and will delay the problem further
Clearing the procedure cache
Detaching and moving the SQL server to a new server
- This one will cause you a world of hurt, because SQL Server will fail to attach on the second server and on your primary. You will need to “hackattach” SQL server at this point and even then, that can be a painful experience
- This dba.stackexhange.com post discusses this method along with the limitations. Though, I am not sure why the LDF was deleted

Knowing what will or will not help in this or any situation requires that you are prepared and have been through it before. This means you will need to create corrupt databases and try to recover the data with minimal to no data loss. Later we will discuss how this can be done and what can be done to recover. However, if you are impatient, I recommend the database corruption challenge by Steve Stedman. Be warned, he dumps you into it right away with the first error being:

I will stop here for a moment though, because while training and practice can go a long way, you MUST know when you might make things worse. A bit of knowledge can be a dangerous thing, while you can understand what the data errors are and what to do next, it is a better idea to get someone involved that is more knowledgeable to correct the problem if you are unsure of a solution or have hesitations. In this scenario you will give up the potential for the high praise you will get from management and team, however, it is a better idea to open a $500 ticket with Microsoft to get some senior assistance.

In fact, even when knowing what the problem is, I always suggest opening a ticket with Microsoft because they will not only give an extra set of eyes on the issue but also their expertise on the subject. Additionally, Microsoft can and will assist you with next steps to help find the root cause of the problem and where the corruption originated from.

Root cause analysis

Root cause analysis is a critical part of this process and must not be overlooked no matter how you recover from the database. This is an important step in preventing the problem from occurring again and potentially sooner than you think. In my experience, when corruption happens, it is bound to happen again if no actions are taken to remediate the problem. Additionally, this is likely to be worse the second time.

Now, I would suggest, that even if you think you know the cause of the corruption (E.G. power failure with no UPS) investigate the following sources anyways. Maybe the outage was just a facilitator and there were warning signs occurring. To begin, I always suggest these places to look.

Memory and disk diagnostics to make sure there are no problems with the existing hardware
SQL Server error logs
Windows event viewer
While rare, check with your vendors to see if they have had problems with the firmware you are using
Software bugs, believe it or not, Microsoft has been known to cause corruption. See KB2969896. This is where opening tickets with Microsoft are also beneficial

The event viewer and SQL server error logs can be viewed together.

But, I suggest splitting these out to the system administrators as they typically have more man power on their team to review these.

By the time you finish this series, the goal will be to that when you find out you have corruption, it is coming from your alerts, not an end user, and you will have an action plan to let your managers know where you sit and what the next steps are. This will not only help your company remain calm but also allow you to work without having someone breathing down your neck constantly.

^[1] Image source: https://smallbiztrends.com/2017/04/not-prepared-for-data-loss.html

Author
Recent Posts

James Rhoat

I am a healthcare information IT professional with a passion for SQL Server and other data technologies.I have two bachelor’s degrees, the first being business administration and the second in management information systems with a specialty in business intelligence. I have grown from a support specialist for an electronic medical record company to a cloud engineer who is the certified system administrator of the business intelligence platform (Qlik Sense). However, my heart still lies with SQL Server as it is what I polished my skills on.
My curious nature leads to me learn about different methodologies for accomplishing tasks more efficiently without compromising on the quality. This does tend to lead one down the rabbit hole, but it often ends in valuable experience that I enjoy sharing with anyone willing to take the time. You can find me on LinkedIn

Do not panic

Root cause analysis

Related posts: