Disaster Recovery, High Availability, Recovery, SQL Server

SQL Server Database Corruption – Basic Guidelines for Successful Recovery

An SQL Server corruption event can be a uniquely stressful problem for support staff. Unfortunately it frequently entails an extremely high pressure problem falling to a support engineer who is not specifically experienced in SQL Server but who is expected to perform a potentially quite sophisticated recovery with little or no notice.

This unhappy combination can lead to very negative outcomes for customers. It is very common for even an experienced SQL Server DBA acting under pressure to make the problem substantially worse than it needed to be. This usually occurs because in the absence of prior experience we will usually default to our inbuilt support instincts. Usually our support instincts serve us very well, but as we will see in the case of corruption, blindly acting on instinct will often lead to significant data loss or downtime.

This quick guide is aimed at providing some very rapid guidance on what to do and more importantly what not to do in a corruption scenario. It’s a work in progress and simply represents the items I immediately think of when a corruption occurs. Feel free to post your suggestions for additional items.

A quick note on the applicability of this advice:

  1. It assumes that you are not an expert in SQL Server. You may be considered an “accidental” DBA or someone who has been drafted in to assist simply because you have more knowledge than the next guy/gal. Each guideline below has exceptions but you really need to be quite knowledgeable in SQL Server recovery to know when it’s appropriate to deviate from the general advice below.
  2. It assumes the data in question is valuable. If you are dealing with a database where uptime is more important than minimising data loss, then consider going straight to backups now. Each customer has different needs so you need to know what’s important to them before acting.

The Guidelines – Checklist

The DON’Ts

What not to do is substantially more important that what you should do. If you remember nothing else from this document, remember the following DON’T’s. Full explanations are provided later.

  • DON’T panic – your system may well still be online
  • DON’T just wade in – you will make things worse
  • DON’T restart the server – it won’t help and will likely make things worse
  • DON’T detach the database – again, it will likely make things worse
  • DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS
  • DON’T delete or rebuild the transaction log

The Dos

Now the Do’s. These are almost always sensible steps and in a time critical situation should be done in parallel as far as is practical. Draft in help if you need to as it could considerably shorten your overall recovery time.

  • DO start a DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS immediately
  • DO get Microsoft Support involved immediately
  • DO isolate and prepare the last known good backups immediately

The Guidelines – Detailed explanations

The Don’ts of database corruption

DON’T panic – your system may still be online

Why?

In probably about 80% of database corruption scenarios I’ve been involved in, we were actually only aware of the corruption due to nightly corruption checks. If we hadn’t been running those checks we probably could have gone months without noticing the corruption from the end user perspective.

The fact is, many databases these days are big and the vast majority of the information is not touched on a daily basis. You could have a 400GB database with 64KB of corruption and it not matter a jot if its in an area of the database that doesn’t routinely get access.

The point is – just because your db is corrupt, does not mean you are dead in the water. If your systems are still up, take your time and plan the best course of action.

DON’T just wade in

Why? When a corruption event occurs, particularly in a mission critical database your first instinct is going to be to act. You can be under a lot of pressure to do something, anything that will get the customer back online. You need to ignore this pressure and your instincts because if you don’t you could make the situation worse.

Formulate a plan. The plan may be as simple as getting on the phone to Microsoft or getting in touch with a colleague that has SQL Server experience. But don’t just start acting without knowing what you’re doing, otherwise you’ll end up doing some variation of one of the DON’Ts listed below

DON’T restart the server

Why? If you know you have genuine database file corruption, the chances that rebooting the server will help matters are vanishingly small. SQL Server is not stupid – if it’s telling you that it has found corruption on disk, there is corruption on disk and far from helping matters restarting the server (or even the SQL Server service) may make matters substantially worse.

The key to understanding this is to understand that just because your database is corrupt does not mean that it is offline. The scale of corruption could be as simple as a single 8K index page out of millions meaning the vast majority of your users may be completely unaffected by the problem. It is entirely possible to recover from some types of corruption without any down time whatsoever.

When you restart the server (or service) SQL Server will perform certain consistency checks as it tries to bring the database online and if these fail, the database will be marked as suspect. If this happens you will have taken a poorly but working database and put it beyond use for all users until you can effect a repair. This is not a good outcome for the customer!

DON’T detach the database

Why? You may be tempted to start doing all sorts of things involving detaching and reattaching the database, perhaps to create an additional copy to work with. Again – don’t do this unless you know exactly what you are doing. The reasoning is similar to the previous guideline but the end result of ignoring this one is actually worse.

Again, even with corruption the database may be perfectly usable for the vast majority of users. However, SQL Server will not let you attach a database it believes to be corrupt. So if you detach it and attempt to reattach it – you may end up with a bigger problem on your hands than if you had left it alone. The way round this is known as the “hack-attach” method and is not something you want to have to figure out how to do in an emergency, especially considering all it will achieve is get you back to where you were in the first place.

DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS

Why? You are most likely to try this if you are frantically looking around the internet for advice on what to do. DO NOT follow this advice – at least not yet.

The Repair_Allow_Data_Loss option of DBCC CHECKDB is usually a last resort. It is what you do when you have no backups and are truly in dire straits. I would go as far as to say that even if you know the implications of what this command does, you should only do it under the advice of someone pretty senior at Microsoft.

The implications of running a repair with data loss are both obvious and non-obvious. The obvious implication is that you may lose data, which may or may not be a problem in itself.

The less obvious implication is that your database may now be inconsistent.

This means inter-table data dependencies may be corrupt, programmatic assumptions your line of business applications may make may no longer hold true, FK constraints may be completely violated and depending on the nature of the data this inconsistency may be substantially more problematic than an actual loss of data.

Recovery from backup is usually preferable to this option, so if anyone tells you to use it and they don’t work for Microsoft – ignore them, unless you truly understand the implications of this option.

DON’T delete or rebuild the transaction log

Why? This is another example where you’ll see an internet “guru’s” giving “advice” to people who are freaking out about log file corruption. To be sure, log file corruption isn’t great – but deleting the good transactions in the log file is probably not what we’d call “Plan A”.

If you fully rebuild the log file there is a reasonable chance that your data will be left in an inconsistent state. Any transactions that were in flight at the time of the corruption will be ripped out of the database with little regards to data consistency. This is the more benign scenario.

More problematic would be where SQL Server was performing some form of internal structural change (for example moving or allocating an index page), all of which are logged in the same way as a data change. In this scenario ripping out the log file could leave the actual database structure in a very bad way.

In most cases here the solution will be to go back to your transaction log backups.

Even if you don’t have transaction log backups, speak to MS directly to get their advice on options because rebuilding the log can be quite similar to CHECKDB REPAIR_ALLOW_DATA_LOSS in terms of its undesirable effects.

The DO’s of database corruption

The list of what you actually should do during a serious corruption event is actually reassuringly short. If you do these next three items along with none of the Don’ts listed above you should feel reassured that you have done a good job and making the most out of a pretty bad hand.

If at all possible start doing these three things in parallel. Draft in help if you can because having these actions underway simultaneously will considerably shorten your overall downtime.

DO run DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS

Why? Unless you already have the output from a recent CHECKDB run, this is probably the first thing you should do. This is because it will be essential in figuring out the extent of the damage and can take a long time to run. Microsoft will very likely want to see the output of a full run of this so you might as well start it now.

One thing to be aware of is that CHECKDB is very IO intensive and may adversely affect the performance of other databases. That said – if you need to try and do an in-place repair of the database you are going to need the output so you might as well get it over and done with. Alternatively, if the database is online and not causing widespread issues, consider leaving the CHECKDB process until it can be done out of hours.

Note that running CHECKDB without any of the repair options is completely non-destructive. It will only collect diagnostic information so is perfectly safe and should not be confused with the Repair_Allow_Data_Loss option.

DO get Microsoft Support involved

Why? The reasoning behind this should be fairly obvious. With a mission critical system you want the best possible outcome for the customer. Many system outages or data loss events can imply a  very real loss in terms of revenue, productivity or customer confidence and the cost of involving Microsoft should be seen as trivial when set against those sorts of numbers.

Even if you are confident in what you are doing and what your recovery options are, I would strongly recommend talking to a senior escalations engineer at Microsoft before committing to a recovery approach that would lose data. In a system down scenario Microsoft can usually be assisting you within 10 – 20 minutes and will give you the confidence that you are taking the most appropriate steps given the customer’s particular circumstances.

DO isolate and localise backups

Why? Many, but not all corruption events require a recovery of some sort from a recent backup. This does not imply immediately going to last night’s full backup and swallowing a day’s data loss. If someone immediately starts talking about restoring from last nights backup without quantifying the exact nature of the corruption – ignore them.  It bears repeating that under many circumstances with sufficient transaction log backups, a recovery of a small number of pages can be implemented with no down time or data loss whatsoever.

The key to minimising the impact and recovery time is to isolate the most recent backups and prepare them for use. In particular, if your backups are on tape or would need to be retrieved across a 100Mb LAN – if possible try and get them “closer” to the server. The time you are running CHECKDB and speaking to MS on the phone is potentially “dead” time in which you could be doing this. This could be as simple as moving the backups from a slow NAS or tape device to a local drive on the server. Every customer environment is slightly different but think about where the backups are and whether there is any way you could make a potential restore faster by pre-positioning the backups in a better location.

Summary

The range of possible variables during a corruption event are very broad and the choices you make can have a very real impact on the outcome.

The first thing to understand is that some corruptions can be really quite benign and if handled correctly would result in no data loss and no downtime. An example of this would be if the corruption existed only in a non-clustered index and not within an actual data page. In this case the index could be dropped and rebuilt and the impact on the customer would be very limited.

Equally, corruption of a small number of data pages could potentially be repaired online with little or no data loss by doing a selective page restore.

It is certainly true that serious data corruptions can occur and need to be taken very seriously. The key to a successful outcome in all cases though is first and foremost to do no harm. Don’t act instinctively and wherever possible involve Microsoft or an SQL Server expert at the earliest opportunity.

Update: Brent Ozar recently provided a mailshot with a couple of usefull additional tips in it. You can read it here: What to Do When DBCC CHECKDB Reports Corruption

Disaster Recovery, Recovery, SQL Server

Sending A Custom Email on DBCC CHECKDB Failure

With SQL Server Maintenance Plans it’s quite possible to run scheduled DBCC CHECKDB consistency checks via an agent job and be emailed when the job fails. Unfortunately the message that gets sent is the generic agent notification which doesn’t necessarily lend itself to understanding quickly what’s going on:

JOB RUN:             ‘Meaningless Job Name’ was run on 25/04/2014 at 14:26:11
DURATION:        0 hours, 0 minutes, 10 seconds
STATUS:               Failed
MESSAGES:        The job failed.  The Job was invoked by User TSG\aUser.  The last step to run was step 1 (Meaningless Step Name).

I wanted to be able to send a much more helpful email message whenever a DBCC CHECKDB failed. In particular I wanted a message that was:

  • Entirely custom  so that I could have it jump out at the receiver as being unusually urgent. The risk in our environment is that with so many emails flying about – something like this could be ignored with disastrous consequences
  • Able to offer guidance on what the to do if you receive the email. This is because the receiver may well not be a SQL Server specialist and may not know how to deal with an urgent corruption issue
  • Able to contain rich formatting to highlight important information

The resulting email is shown below, but you could get much fancier with your formatting if you wish:

Email on DBCC Failure

I haven’t tested the code extensively yet but we are trialling it in production without issue at the moment. I thought I would share the code in case anyone would find it useful. There are a variety of very clever DBCC CHECKDB scripts out there already, but none of the ones I found were written to allow the sending of a custom email body.

Instructions are in line. I’d recommend you run this as an SQL Agent job and have the job also email you on failure to protect against any bugs in the code.


-- Overview
-- Performs a full DBCC CHECKDB on all databases specified below and will email any failures to the email address specified
-- How to use
-- 1. Ensure the mail profile is configured in SQL Server
-- 2. Provide a valid "to" email address
-- 3. Add the databases to check to the @databasesToCheck table variable
-- 4. Schedule this script to run during an appropriate maintenance window

-- Future Improvement
-- Log the output of DBCC to a semi-permenant location to avoid the need to re-run it after the error is detected. This would be important in
-- very large db setups or those with very high availability requirement

-- SETUP STEP
-- Ensure the following Database Mail profile is configured and an appropriate email address specified
DECLARE @mailProfileName VARCHAR(250) = '<INSERT MAIL PROFILE NAME>'
DECLARE @toEmailAddresses VARCHAR(500) = '<INSERT THE TO EMAIL ADDRESS>'

DECLARE @databasesToCheck TABLE (
	ID INT IDENTITY (1, 1),				-- This allows use to select and process databases in a specified order
	databaseName VARCHAR(250)
)

-- SETUP STEP
-- Add a row for each database you wish to check
INSERT INTO @databasesToCheck
VALUES
('DATABASE TO CHECK 1'),
('DATABASE TO CHECK 2'),

BEGIN TRY
	DECLARE @dynamicSQL NVARCHAR(500)
	DECLARE @currentDatabaseName VARCHAR(250)

	-- Define a cursor to allow us to process each database in turn
	DECLARE curDatabases CURSOR FOR
	SELECT databaseName
	FROM @databasesToCheck
	ORDER BY ID ASC

	OPEN curDatabases

	FETCH NEXT FROM curDatabases
	INTO @currentDatabaseName

	-- Loop through each database we want to check
	WHILE @@FETCH_STATUS = 0
		BEGIN

			DECLARE @currentProcessingResult INT

			-- The following statements are really for the benefit of showing something useful in the job history window:
			PRINT ('Processing database ' + @currentDatabaseName)

			-- Not ideal but the only way I know of to dynamically build adhoc DBCC commands
			-- NOTE: We surpress info messages when things are going well. We only care when things go wrong
			SET @dynamicSQL = 'DBCC CHECKDB ([' + @currentDatabaseName + ']) WITH NO_INFOMSGS'

			PRINT ('  Executing DBCC CHECKDB on target database ')

			-- NOTE: Running a straight DBCC CHECKDB will not raise an exception even if corruption is found. The only way to determine if failure occured
			-- is to use the return value and determine if we need to raise our own exception. See here for further details:
			-- http://support.microsoft.com/kb/953503
			EXEC @currentProcessingResult = sp_executesql @dynamicSQL

			-- If the last run detected an error, raise an error and stop
			IF @currentProcessingResult <> 0
				BEGIN
					-- Weird limitation in RAISERROR - it can't take a string concatentation so we need to do it like this
					DECLARE @errorMessage VARCHAR (1000) = 'DBCC CHECK DB failed for database ' + @currentDatabaseName
					RAISERROR (@errorMessage, 11, 1)
				END

			PRINT ('  DBCC CHECKDB completed successfully with no errors on database ' + @currentDatabaseName)

			FETCH NEXT FROM curDatabases
			INTO @currentDatabaseName

		END

	CLOSE curDatabases
	DEALLOCATE curDatabases

END TRY

BEGIN CATCH
	DECLARE @emailBody NVARCHAR(4000)
	DECLARE @catchErrorMessage VARCHAR (1000) = ERROR_MESSAGE()

	PRINT('Critical Error Detected: ' + @catchErrorMessage)

	-- Generate a formatted email using HTML. Trying to format in plain text is a bit of a pain. HTML allows you to create a template in Visual Studio
	-- and pretty much copy it in here verbatim
	SET @emailBody = N'
	        <style type="text/css">
                #divContent {
                    font-family: Tahoma, Arial, Verdana, sans-serif;
                    font-size: 0.8em;
                }
            </style>

        <div id="divContent">

		    <p><strong>Warning: A potentially serious issue has been detected during overnight corruption checks on SQL Server. Please do not ignore this message as our ability to recover data may be effected</strong>
            </p>

            <p>The error message provided by the agent job was <em>"' + @catchErrorMessage + '"</em> </p>' +

            '<p><strong>Why has this message been sent?</strong>
                <br />
                This message has been sent because corruption has been detected in the following database: <br />
            </p>

            <ul><li><strong>' + @currentDatabaseName + '</strong> </li></ul>

            <p>Please note that the overnight DBCC CHECKDB process will have stopped once the first database corruption was detected. Other databases may also be corrupt and should be checked manually</p>
            <p>
                <strong>Why is it important that we respond quickly?</strong> <br />
                If one of the live databases has been corrupted, the best chance of recovering data is via a previous backup. Because backups are only retained for a short period of time it is essential that we isolate the backup most likely to help us recover as quickly as possible <br />
            </p>

            <p>
                <strong>What should happen next?</strong>
            </p>

            <ul>
                <li>Run DBCC CHECKDB on the problem database again to retrieve the full details of the corruption. This may take a significant amount of time to complete so consider saving the results.</li>
                <li>Isolate all backups from the affected database as they may be needed to effect a repair</li>
                <li>Contact Microsoft Technical Support to get their assistance with the repair operation</li>
                <li><strong>DO NOT</strong> run DBCC CHECKDB with "repair_allow_data_loss" before speaking to Microsoft first! This may cause data loss unnecessarily and will quite possibly make the situation worse</li>
                <li>If possible find a location to restore at least one backup to</li>
                <li>Investigate the underlying cause of the corruption. It is important to understand why the corruption occured before it occurs again</li>
            </ul>

                 </div>
            '

	EXEC msdb.dbo.sp_send_dbmail
		@profile_name = @mailProfileName,
		@recipients = @toEmailAddresses,
		@body_format = 'HTML',
		@subject = 'WARNING: A potentially serious issue has been detected during overnight database consistency checks',
		@body = @emailBody

	-- This is minging but depending on how the code above fails, we could still have an open cursor at this point.
	-- Close it and ignore the possibility that its already closed
	BEGIN TRY
        --clean it up
		CLOSE curDatabases
		DEALLOCATE curDatabases
    END TRY
    BEGIN CATCH
        -- Would happen if the cursor is already closed. Wishing TSQL had a TRY CATCH FINALLY construct here
    END CATCH

	-- Raise an additional exception to cause any calling job to also fail. This will increase the chance of detection if there is a problem
	-- with database mail
	RAISERROR (@catchErrorMessage, 11, 1)

END CATCH

Disaster Recovery, High Availability, Performance, Recovery, SQL Server, Troubleshooting

How to halve your database recovery time in 60 seconds

Introduction

Picture the scene. Your sitting at you’re desk one morning and a user bursts in in a blind panic and tells you your company’s most critical system is down and your business is currently dead in the water.

Hundreds, maybe thousands of users are now sitting idle wondering what’s going on. Scores of customers are also getting increasingly irate with each passing second. Senior management are asking what the hell is going on.

After some frantic phone calls and emergency investigation work you realise that something catastrophic has happened to the single SQL Server database that underpins your entire operation. The damage to the database is so bad that the only way to recover it is a full database restore from last nights backup.

Oh, and the database is 500 GB+

  • Lets leave aside the time it takes you to figure out roughly what’s going on
  • Lets leave aside the time it takes for you to decide that a restore is the only option
  • Lets also leave aside the  fact that you have no replacement hardware and assume you have a perfectly sized location ready and waiting for you (really?)
  • Lets leave aside any time that may be required to find and transfer backups to where they need to be
  • Lets assume we don’t have the added complexity of needing to figure out how to do a transaction log restore because you’ve never actually needed to do one

All we’re interested in is how long it would take from the moment you ran the RESTORE command, for the database to be fully recovered?

Or put another way, how long would your business be well and truly up shit creek?

Instant File Initialisation

If you would be the poor happless schmuck who has to paddle the business out of said creek using only your hands, you owe it to yourself to take a look at whether your databases could take advantage of Instant File Initialisation.

If you deal in “large” databases, you should check this out right now because this feature may turn out to be the difference between a shit day and a very shit day.

Instant File Initialisation is a feature introduced in SQL Server 2005 that allows SQL Server to allocate MDF space instantly without having to zero initialise it first.

There are a range of scenarios that this improves (creating and autogrowing databases amongst them) but by far the most important scenario it improves is during a database recovery.

The concept of zero initialisation probably doesnt mean that much to you if you are dealing with small-ish databases. However if you deal in databases in the hundreds of gigabyte range or larger then you really need to understand the impact it will have.

In the scenario I described above, as part of a recovery of 500 GB SQL Server would have to write ~500 GB of zeros to the recovered data file before even starting to restore actual data.

I’m not going to explain how to switch Instant File Initialisation on or provide much background material on the feature. Far smarter people than I have already done this.

Check out the following couple of articles for the necessary details:

  • Kimberley Tripp did a great article on this back in the day. She also gave the most understandable description I’ve seen about why some hyper security concious setups might not want to do this (but in fact probably still should)
  • You can also look here for the basic details on how to enable this feature.

In the rest of this post I’m going to show a quick test I did on a 260 GB database to demonstrate the scale of improvement this simple change can have.

The Test

Equipment

  • HP Elite Book
  • Quad Core I7
  • 32GB RAM
  • C Drive: 7200 RPM disk with 20GB SSD cache
  • E Drive: 7200 RPM external disk over USB 3.0

Database to Recover:

  • 123 GB compressed backup
  • 267 GB Data file
  • 65 GB Log file

To avoid disk contention as much as possible, the backup file is located on E and the database will be recovered to C. Not exactly server hardware but its the best I could do!

Test 1: Instant File Initialisation Off

  • Total Time to Recover: 2 hours 15 Minutes

Test: 2 Instant File Initialisation On

  • Total Time to Recover: 1 hour 19 minutes

Conclusions

In a recovery crisis, having Instant File Initialisation enabled could reduce your restore time by approaching half.

In the scenario we have above, we could improve the situation even further by attempting to minimise the size of the log file. At 65 GB we still need to zero initialise a substantial amount of data. This is because unlike data files the log file must be zero initialised before use. If we could control the size of the log file (perhaps through changing bulk import or indexing strategies) we could potentially minimise this cost and reduce the time a bit further.

I say a bit because as far as I can tell SQL Server seems to recover both the data file and log file in parallel rather than sequentially. I’ve never looked in detail at what exactly SQL Server is doing whilst writing to both the data file and log file at the same time, but at a guess it could be recovering the data file whilst simultaneously zeroing out the log.

This means it’s not as though the recovery is completely stalled whilst zero initialisation takes place on the log.

If we extrapolate the numbers up a bit, if we were dealing with a 500 GB database we might expect that to restore in about 4 – 5 hours. With Instant File Initialisation enabled that number might be around the 2 – 2.5 hour mark.

This probably won’t be much comfort the day this happens to you for real, but at least you’ll know that you’ve made the best  you could have out of a pretty bad hand

I hope this helps someone!

Disaster Recovery, High Availability, Performance, Recovery, SQL Server

Monitoring Dirty Pages in the Buffer Pool

A few people have commented that they found a post I did a while back on per database RAM usage quite helpful. You can see that article here:

https://rule30.wordpress.com/2014/03/31/sql-server-determining-which-database-is-using-the-most-memory-right-now/

I didn’t mention it at the time but I was actually trying to do something a little bit more subtle than just seeing which databases were using the most RAM.

The additional part of what I was trying to do was to determine how much of each database is currently “dirty” and waiting to be flushed to disk at the next database checkpoint.

This might be of interest if you are looking at what volume of changes would need to be reapplied during the REDO phase of a database recovery or failover. It would be of particular interest if you are looking at how quickly a failover cluster would take to failover from one host to another as the amount of REDO work has a direct bearing on how quickly this can occur.

The script is as follows:


-- Generate various metrics on how much RAM various databases are taking up via the buffer pool along with how many pages are dirty and will be flushed at the next
-- checkpoint
SELECT 

	DB_NAME (database_id),

	-- The total number of 8K data and index pages used by the database
	COUNT(*) AS TotalNumberOfPages,

    -- Count the total number of 8K pages in memory with no known changes
	SUM(CASE WHEN Is_Modified = 0 THEN 1 ELSE 0 END) AS CleanPages,

	-- Count the total number of 8K pages in memory that have changes and will be flushed at the next checkpoint
	SUM(CASE WHEN Is_Modified = 0 THEN 0 ELSE 1 END) AS DirtyPages,

	-- The total amount of the buffer pool used by the database by all index and data pages (in megabytes)
	COUNT(*) / 128 AS TotalNumberOfPagesInMegabytes,

	-- Count the total amount of memory used by pages with no known changes - in megabytes
	SUM((CASE WHEN Is_Modified = 0 THEN 1 ELSE 0 END)) / 128 AS CleanPagesInMegabytes,

	-- Count the total amount of memory used by dirty pages that will be flushed at the next checkpoint - in megabytes
	(SUM(CASE WHEN Is_Modified = 0 THEN 0 ELSE 1 END)) / 128 AS DirtyPagesInMegabytes

-- NOTE: Will return 1 row for *every page* currently sitting in the buffer pool.
-- On large systems this can return a large numbers of rows
FROM
	sys.dm_os_buffer_descriptors WITH (NOLOCK)

GROUP BY
	database_id	

ORDER BY
	COUNT(*) DESC

This should give results along the lines of:

DirtyPages

For most users this information is probably fairly academic, but if you have a very write-heavy system – the amount of data getting checkpointed out over time might of some interest to you, particularly if you are responsible for HA/DR SLAs.

Best Regards

High Availability, Recovery, SQL Server

Rookie SQL Server Mistake: Check your editions BEFORE doing a massive restore

A quick tip to anyone doing a large restore of SQL Server databases – check the two editions BEFORE starting. By edition, I don’t mean 2012 -> 2014, I mean Enterprise Edition-> Whatever

I recently had to backup, transfer then restore a 300GB database. Checked the versions – 2008-> 2008 no problem. So off it went.

An hour and a half passes and then I get:

Database ‘<Database Name>’ cannot be started in this edition of SQL Server because it contains a partition function ‘AuditPFN’. Only Enterprise edition of SQL Server supports partitioning. 

SQL Server Editions Gotcha

Can’t really argue with the facts there. Partitioning is indeed an Enterprise Edition only feature and I was indeed trying to restore to a Standard Edition install. I’m normally pretty good about checking my assumptions before starting but must have been having a senior moment today.

My only real beef with the error message, and the reason for this post is to warn people that you have to wait for the restore to fully complete before it will tell you that it was never going to work. This is much more annoying when restoring larger databases.

In my case it isn’t a big deal, just some wasted time. I’ll just have to install Developer Edition on my machine, which is something I should have done originally anyway.

The scenario that is a bit scarier is when you are doing a huge restore for real – in full on crisis mode. In this case – make absolutely certain that you are restoring to an appropriate edition before starting. In an emergency you won’t have an additional X number of hours to discover your restore was never going to work in the first place.

Disaster Recovery, High Availability, Recovery, SQL Server

High Availability != Disaster Recovery – SQL Server Edition

Its quite common to hear even seasoned IT pro’s use the terms High Availability and Disaster Recovery as though they are synonymous. With SQL Server in particular I think part of the confusion stems from the fact that many of the technologies involved can be employed in both areas with varying degrees of success. Further complicating matters is the fact that some of these technologies can also be used in areas such as scale out and reporting that have nothing to do with HA/DR at all.

High Availability

High Availability is the art of ensuring a system has as close to 100% uptime as is possible under normal operating conditions.

The main strategy for achieving high levels of uptime is redundancy – having 2 or more of everything. 2 or more servers, 2 or more NICs, 2 or more power supplies, 2 or more switches…. You get the idea.

The other part of this redundnacy strategy is having components, servers and software that can fail over to the redundant equipment as quickly and transparently as possible.

High Availability “events”, such as a disk or NIC failure will normally be much more common than a Disaster Recovery event. They are so common in fact that they should be considered routine in even a small-medium sized business. Disks, power supplies, switches and even entire servers fail every day but with the correct levels of redundancy users should normally be completely unaware of the failure.

Disaster Recovery

Disaster Recovery on the other hand is concerned with handling the truly extraordinary, and unless you have spent some serious coinage – it is very unlikely to be painless.

Disaster Recovery events should be the complete opposite of High Availability events. Whereas power supply, switches and even whole servers failing should be managed as routine day to day events, Disaster Recovery situations should be exceptionally rare.

In my experience Disaster Recovery has always fallen into one of two scenarios:

Scenario 1: Serious Data Loss or Corruption

This can occur through some sort of horrendous user error or some sort of system failure. Power cuts/spikes in particular seem to have a nasty habit of borking SQL Server databases. Either that or I’m just unlucky.

Scenario 2: The “Godzilla” Event

A Godzilla event is where a significant amount of equipment or even an entire site has been put out of action for a prolonged period.

Examples here would be fire, flooding, an air conditioning unit leaking fluid onto your only server cabinet or workmen cutting through a key piece of fibre. A giant rampaging lizard would also fall into this category.

Massive data loss and rampaging lizards are both most definitely disasters, but will likely require different types of response.

In the case of serious data loss or corruption we are probably looking at some sort of recovery from backups. Thankfully your primary site is still in action so total recovery time will be a function of how quickly you can find, transfer and restore from backup. There can be all sorts of issues here in practice including remembering how to perform the restore, the correct recovery point to go to and storage capacity issues whilst you do the recovery.

If you want to mitigate against the “Godzilla” event you basically need to be in the business of being able to fail key systems over to a completely different geographic site. As you might imagine the cost of this can be significant in terms of hardware, WAN links, software and technical skills. The quicker and more transparently you want this to occur, the harder it becomes.

Which technology sits where?

Below is a simple list of SQL Server and associated technologies and where I think they sit in the HA/DR landscape. Notice that some technologies sit quite comfortably in both camps, hence the confusion described earlier.

Synchronous technologies are better placed in the HA camp and Async in the DR camp. That said though – there are no hard and fast rules and I’ve seen Synchronous mirroring and replication used for DR more than once.

High Availability

Disaster Recovery

Redundant Hardware and Comms
SQL Server Failover Clustering
Automatic Page Repair
Server and Database Backups
Log Shipping

Both

Sync/Async Database Mirroring
Sync/Async SQL Server Replication
Sync/Async AlwaysOn Availability Groups
Sync/Async SAN/VM Replication

Let me know if you think I’ve missed a key technology or if you think I have any tech in the wrong place!

Cheers

Performance, Recovery, SQL Server

Delayed Durability in SQL Server 2014 – Part 3 – When should I use it?

In Part 1 of this series we introduced the concept of transaction Durability and its importance in modern relational databases. We also saw that for the first time in SQL Server 2014 Microsoft are going to allow us to control or at least influence transaction durability through its new Delayed Durability feature. In Part 2 we saw that taking advantage of the Delayed Durability feature may offer write-heavy workloads a substantial performance improvement.

In this final post I’m going to offer some brief thoughts on the questions you should ask when considering use of Delayed Durability.

Is my workload suitable?

This is the single most important question we need to answer when considering using Delayed Durability. All other factors are academic if you answer “no” to this question.

So when might a workload be suitable? The short answer to this is that your workload is suitable if you can tolerate data loss. So perhaps the question here should be, when might I be able to tolerate data loss?

When data can be reconstituted from other sources (easily)

It is not uncommon to have data being transferred between systems in a way that is somewhat repeatable. This might be some sort of import into a reporting system or overnight download of transactions. I’ve highlighted the word easily above because just because data can be reconstituted theoretically does not mean it can be done in practice. You don’t want to be figuring out the distance between theory and practice whilst you are trying to recover from some sort of data loss event!

That said, if the mechanism for reconstituting the data is relatively simple , rapid and well understood you may well be able to cope with data loss.

So when else might you be able to tolerate data loss?

When losing data in one system does not automatically imply corruption in another

When considering whether you can deal with data loss you need to think very carefully about how your system interacts with other related systems. Many modern IT systems involve multiple co-dependent databases.

If you consider a web application that is supported by 3 separate databases, is it acceptable for a transaction to go missing in one of them, but not the other two? It’s certainly possible, but you’re going to want to see the small print on that one. The devil is going to be in the detail.

Is Delayed Durability really the only option I have?

It takes a lot to saturate a well tuned database running on decent hardware. In fact, I have only ever come across 2 systems that were being clogged up by write traffic.  It is much more common to see systems struggling with read performance. In the first case the issue was a single poorly structured stored procedure causing system wide havoc. In the second case the workload was actually quite modest – it was just being run against vastly underpowered hardware (think 4GB of RAM for a 130GB database with data, logs and OS all on the C drive!).

The point is – good database design and appropriately specified hardware can easily get you into the realms of tens of thousands of transactions per second. Because Delayed Durability implies data loss it should probably be well down your list of things to look into if you need to squeeze some more performance out of a system.

Bottom Line

Enabling Delayed Durability is a little bit like disabling the air bags in your car. For 99% of us that would be a pretty stupid and thankfully unnecessary thing to do. For the remaining 1% of systems that really need the performance we need to think carefully about how we would deal with the implied data loss in practice. We would also want to make sure that we can’t meet our performance requirements through design changes or more appropriate hardware – after all, why settle for data loss if we don’t need to?

The addition of Delayed Durability in SQL Server 2014 is a welcome one. It brings SQL Server more into line with both Oracle and MySQL in terms of giving the DBA/Developer the flexibility to control how strictly SQL Server enforces the Durability aspect of ACID compliance.