Disaster Recovery, High Availability, Recovery, SQL Server

SQL Server Database Corruption – Basic Guidelines for Successful Recovery

An SQL Server corruption event can be a uniquely stressful problem for support staff. Unfortunately it frequently entails an extremely high pressure problem falling to a support engineer who is not specifically experienced in SQL Server but who is expected to perform a potentially quite sophisticated recovery with little or no notice.

This unhappy combination can lead to very negative outcomes for customers. It is very common for even an experienced SQL Server DBA acting under pressure to make the problem substantially worse than it needed to be. This usually occurs because in the absence of prior experience we will usually default to our inbuilt support instincts. Usually our support instincts serve us very well, but as we will see in the case of corruption, blindly acting on instinct will often lead to significant data loss or downtime.

This quick guide is aimed at providing some very rapid guidance on what to do and more importantly what not to do in a corruption scenario. It’s a work in progress and simply represents the items I immediately think of when a corruption occurs. Feel free to post your suggestions for additional items.

A quick note on the applicability of this advice:

  1. It assumes that you are not an expert in SQL Server. You may be considered an “accidental” DBA or someone who has been drafted in to assist simply because you have more knowledge than the next guy/gal. Each guideline below has exceptions but you really need to be quite knowledgeable in SQL Server recovery to know when it’s appropriate to deviate from the general advice below.
  2. It assumes the data in question is valuable. If you are dealing with a database where uptime is more important than minimising data loss, then consider going straight to backups now. Each customer has different needs so you need to know what’s important to them before acting.

The Guidelines – Checklist

The DON’Ts

What not to do is substantially more important that what you should do. If you remember nothing else from this document, remember the following DON’T’s. Full explanations are provided later.

  • DON’T panic – your system may well still be online
  • DON’T just wade in – you will make things worse
  • DON’T restart the server – it won’t help and will likely make things worse
  • DON’T detach the database – again, it will likely make things worse
  • DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS
  • DON’T delete or rebuild the transaction log

The Dos

Now the Do’s. These are almost always sensible steps and in a time critical situation should be done in parallel as far as is practical. Draft in help if you need to as it could considerably shorten your overall recovery time.

  • DO start a DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS immediately
  • DO get Microsoft Support involved immediately
  • DO isolate and prepare the last known good backups immediately

The Guidelines – Detailed explanations

The Don’ts of database corruption

DON’T panic – your system may still be online

Why?

In probably about 80% of database corruption scenarios I’ve been involved in, we were actually only aware of the corruption due to nightly corruption checks. If we hadn’t been running those checks we probably could have gone months without noticing the corruption from the end user perspective.

The fact is, many databases these days are big and the vast majority of the information is not touched on a daily basis. You could have a 400GB database with 64KB of corruption and it not matter a jot if its in an area of the database that doesn’t routinely get access.

The point is – just because your db is corrupt, does not mean you are dead in the water. If your systems are still up, take your time and plan the best course of action.

DON’T just wade in

Why? When a corruption event occurs, particularly in a mission critical database your first instinct is going to be to act. You can be under a lot of pressure to do something, anything that will get the customer back online. You need to ignore this pressure and your instincts because if you don’t you could make the situation worse.

Formulate a plan. The plan may be as simple as getting on the phone to Microsoft or getting in touch with a colleague that has SQL Server experience. But don’t just start acting without knowing what you’re doing, otherwise you’ll end up doing some variation of one of the DON’Ts listed below

DON’T restart the server

Why? If you know you have genuine database file corruption, the chances that rebooting the server will help matters are vanishingly small. SQL Server is not stupid – if it’s telling you that it has found corruption on disk, there is corruption on disk and far from helping matters restarting the server (or even the SQL Server service) may make matters substantially worse.

The key to understanding this is to understand that just because your database is corrupt does not mean that it is offline. The scale of corruption could be as simple as a single 8K index page out of millions meaning the vast majority of your users may be completely unaffected by the problem. It is entirely possible to recover from some types of corruption without any down time whatsoever.

When you restart the server (or service) SQL Server will perform certain consistency checks as it tries to bring the database online and if these fail, the database will be marked as suspect. If this happens you will have taken a poorly but working database and put it beyond use for all users until you can effect a repair. This is not a good outcome for the customer!

DON’T detach the database

Why? You may be tempted to start doing all sorts of things involving detaching and reattaching the database, perhaps to create an additional copy to work with. Again – don’t do this unless you know exactly what you are doing. The reasoning is similar to the previous guideline but the end result of ignoring this one is actually worse.

Again, even with corruption the database may be perfectly usable for the vast majority of users. However, SQL Server will not let you attach a database it believes to be corrupt. So if you detach it and attempt to reattach it – you may end up with a bigger problem on your hands than if you had left it alone. The way round this is known as the “hack-attach” method and is not something you want to have to figure out how to do in an emergency, especially considering all it will achieve is get you back to where you were in the first place.

DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS

Why? You are most likely to try this if you are frantically looking around the internet for advice on what to do. DO NOT follow this advice – at least not yet.

The Repair_Allow_Data_Loss option of DBCC CHECKDB is usually a last resort. It is what you do when you have no backups and are truly in dire straits. I would go as far as to say that even if you know the implications of what this command does, you should only do it under the advice of someone pretty senior at Microsoft.

The implications of running a repair with data loss are both obvious and non-obvious. The obvious implication is that you may lose data, which may or may not be a problem in itself.

The less obvious implication is that your database may now be inconsistent.

This means inter-table data dependencies may be corrupt, programmatic assumptions your line of business applications may make may no longer hold true, FK constraints may be completely violated and depending on the nature of the data this inconsistency may be substantially more problematic than an actual loss of data.

Recovery from backup is usually preferable to this option, so if anyone tells you to use it and they don’t work for Microsoft – ignore them, unless you truly understand the implications of this option.

DON’T delete or rebuild the transaction log

Why? This is another example where you’ll see an internet “guru’s” giving “advice” to people who are freaking out about log file corruption. To be sure, log file corruption isn’t great – but deleting the good transactions in the log file is probably not what we’d call “Plan A”.

If you fully rebuild the log file there is a reasonable chance that your data will be left in an inconsistent state. Any transactions that were in flight at the time of the corruption will be ripped out of the database with little regards to data consistency. This is the more benign scenario.

More problematic would be where SQL Server was performing some form of internal structural change (for example moving or allocating an index page), all of which are logged in the same way as a data change. In this scenario ripping out the log file could leave the actual database structure in a very bad way.

In most cases here the solution will be to go back to your transaction log backups.

Even if you don’t have transaction log backups, speak to MS directly to get their advice on options because rebuilding the log can be quite similar to CHECKDB REPAIR_ALLOW_DATA_LOSS in terms of its undesirable effects.

The DO’s of database corruption

The list of what you actually should do during a serious corruption event is actually reassuringly short. If you do these next three items along with none of the Don’ts listed above you should feel reassured that you have done a good job and making the most out of a pretty bad hand.

If at all possible start doing these three things in parallel. Draft in help if you can because having these actions underway simultaneously will considerably shorten your overall downtime.

DO run DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS

Why? Unless you already have the output from a recent CHECKDB run, this is probably the first thing you should do. This is because it will be essential in figuring out the extent of the damage and can take a long time to run. Microsoft will very likely want to see the output of a full run of this so you might as well start it now.

One thing to be aware of is that CHECKDB is very IO intensive and may adversely affect the performance of other databases. That said – if you need to try and do an in-place repair of the database you are going to need the output so you might as well get it over and done with. Alternatively, if the database is online and not causing widespread issues, consider leaving the CHECKDB process until it can be done out of hours.

Note that running CHECKDB without any of the repair options is completely non-destructive. It will only collect diagnostic information so is perfectly safe and should not be confused with the Repair_Allow_Data_Loss option.

DO get Microsoft Support involved

Why? The reasoning behind this should be fairly obvious. With a mission critical system you want the best possible outcome for the customer. Many system outages or data loss events can imply a  very real loss in terms of revenue, productivity or customer confidence and the cost of involving Microsoft should be seen as trivial when set against those sorts of numbers.

Even if you are confident in what you are doing and what your recovery options are, I would strongly recommend talking to a senior escalations engineer at Microsoft before committing to a recovery approach that would lose data. In a system down scenario Microsoft can usually be assisting you within 10 – 20 minutes and will give you the confidence that you are taking the most appropriate steps given the customer’s particular circumstances.

DO isolate and localise backups

Why? Many, but not all corruption events require a recovery of some sort from a recent backup. This does not imply immediately going to last night’s full backup and swallowing a day’s data loss. If someone immediately starts talking about restoring from last nights backup without quantifying the exact nature of the corruption – ignore them.  It bears repeating that under many circumstances with sufficient transaction log backups, a recovery of a small number of pages can be implemented with no down time or data loss whatsoever.

The key to minimising the impact and recovery time is to isolate the most recent backups and prepare them for use. In particular, if your backups are on tape or would need to be retrieved across a 100Mb LAN – if possible try and get them “closer” to the server. The time you are running CHECKDB and speaking to MS on the phone is potentially “dead” time in which you could be doing this. This could be as simple as moving the backups from a slow NAS or tape device to a local drive on the server. Every customer environment is slightly different but think about where the backups are and whether there is any way you could make a potential restore faster by pre-positioning the backups in a better location.

Summary

The range of possible variables during a corruption event are very broad and the choices you make can have a very real impact on the outcome.

The first thing to understand is that some corruptions can be really quite benign and if handled correctly would result in no data loss and no downtime. An example of this would be if the corruption existed only in a non-clustered index and not within an actual data page. In this case the index could be dropped and rebuilt and the impact on the customer would be very limited.

Equally, corruption of a small number of data pages could potentially be repaired online with little or no data loss by doing a selective page restore.

It is certainly true that serious data corruptions can occur and need to be taken very seriously. The key to a successful outcome in all cases though is first and foremost to do no harm. Don’t act instinctively and wherever possible involve Microsoft or an SQL Server expert at the earliest opportunity.

Update: Brent Ozar recently provided a mailshot with a couple of usefull additional tips in it. You can read it here: What to Do When DBCC CHECKDB Reports Corruption

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s