Disaster Recovery, High Availability, SQL Server, SQL Server 2016

Temporal Tables Coming in SQL Server 2016

A while back I wrote a piece about how Oracle Flashback was the only Oracle feature remaining that I would love for SQL Server. This after many years early in my career where I was frequently hit with occasional bouts of database envy.

You can see the full details of why I hanker for this feature here, https://rule30.wordpress.com/2014/04/23/the-only-oracle-feature-i-actually-still-want/ – but in short its because it lets you do this:

SELECT * FROM Orders
AS OF TIMESTAMP -- Say Wwwwhat!!
TO_TIMESTAMP('2014-04-23 09:30:00', 'YYYY-MM-DD HH:MI:SS')

The great news is that Satya Nadella has clearly been checking out the blog here and “flashback” queries are going to be included in SQL Server 2016 under the name Temporal Tables or System Versioned tables. Full details here:

Temporal Tables in SQL Server 2016

The key syntax is shown below:

-- To the Delorian!
SELECT * FROM Orders 
FOR SYSTEM_TIME AS OF '2014-04-23 09:30:00' 

Fantastic! The key use case I have currently is in providing a recovery option for managing VLDBs however there are very obvious applications for this in BI and Auditing considering the feature has been implemented as a simple first class concept in standard T-SQL.

A couple of observations:

  • As of CTP 2 Management Studio tooling doesn’t seem to have any support for enabling temporal/system versioned tables. Opting to create a System Versioned table throws you out to a standard coding window where you’ll have to get your DDL on. I’m guessing the tooling will catch up before release
  • I’m really liking the simplicity of the implementation. Enabling and managing Flashback in Oracle seemed like a bit of an ordeal to me so MS are really playing to their strengths. Once created in SQL Server a temporal table is really just a standard table with a secondary hidden table being maintained for you behind the scenes – its that simple
  • I’m not clear which versions of SQL Server will come with temporal tables. It “feels like” an Enterprise only feature given its high availability/big data applications but I haven’t seen anything on that yet. Even at SQL Server Enterprise pricing though its almost certainly going to be a bargain compared to Oracle.
Corruption, SQL Server

Handling real world corruption of a SharePoint Content Database

You never want to come in to work to find that one of your critical SQL Server databases is corrupt but that’s exactly what happened today due to a SAN issue. I’ve got a fair amount of SQL Server experience but not so much so that a failed SQL Server corruption check doesn’t instinctively cause me to wonder out loud how up to date my CV is  😉

That said we did manage to repair the corruption with no downtime and no data loss so I thought I’d do a quick blog to describe what I did to resolve the issue in case it helps anyone who has a similar issue.

Its important to understand that there are a huge variety of corruption scenarios that can occur in the real world and the lessons here may well not apply to you. Our DB failed in a very specific way that lent itself to a relatively quick repair with no downtime. You may be able to take some of the ideas here and run with them though so I thought it worth a share.

What did SQL Server think was wrong?

We picked up on the problem when one of our custom maintenance jobs emailed me the results of a failed DBCC CHECKDB command. These corruption checks are crucial as they tell you about problems whilst you still have enough backups in rotation to be able to do something about them. I’ve actually gone as far as running them nightly using some hand crafted scripts that will email step by step guidance to the IT team on how to handle the problem in case I’m on holiday. If you aren’t doing frequent DBCC  CHECKDBs, you really need to start now.

The first thing we want to do is find out what SQL Server thinks is wrong. The key thing to do here is not panic. The SharePoint system in question was clattering a long quite happily at this point so there is plenty of time to gather diagnostics.

The first thing we want to do is to run a full CHECKDB. This will give us SQL Server’s assessment of what’s wrong and would likely be the first thing MS would ask for if you called them:

DBCC CHECKDB ('<database.') with no_infomsgs

This gave us the following output, giving us our initial handle on how severe the problem was (apologies for the formatting):

Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373408), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373409), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373410), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373411), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373412), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373413), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373414), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373415), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8929, Level 16, State 1, Line 1
Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594094747648 (type In-row data): Errors found in off-row data with ID 1311506432 owned by data record identified by RID = (1:2277271:5)
Msg 8929, Level 16, State 1, Line 1
Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594094747648 (type In-row data): Errors found in off-row data with ID 333774848 owned by data record identified by RID = (1:2282679:0)
CHECKDB found 0 allocation errors and 10 consistency errors in table 'AllDocStreams' (object ID 373576369).
CHECKDB found 0 allocation errors and 10 consistency errors in database 'WSS_Content_CustomerDocs'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (WSS_Content_CustomerDocs).

Seems fairly scary but I actually started to relax a little once I read this, for reasons that will become apparent.

The first thing to understand about this database is that it is a large sharepoint content database.

This is actually great news because it means that the *vast* majority of the data in the db is not actually *row* data, it’s going to be BLOB data that represents the various files in SharePoint. I say good because in a 100GB content database, sheer probability suggests that if we did get a small corruption it would very likely fall within the BLOB data of a single file rather than row data or the intrinsic data structures of SQL Server. Actual corruption to rows or intrinsic data structures are far more problematic and I’d gladly write off a word document or powerpoint presentation if it meant not having to deal with that 😉

Reading through the output we can see that SQL Server is having an issue reconciling the row data stored on a single 8K page with the binary LOB data it stores “off row”. This tells me the following:

  • The root issue is with a single 8K page
  • The broader issue with that 8K page is that some pointers to the “off row” data seems a bit messed up.
  • Given the number of “off row” pages involved (8), the corruption seems to be very limited in scope – probably about 64KB given an 8K page size. It’s quite likely in fact that if we could pin it down, we might only be dealing with one or two corrupt files

In the grand scheme of things, this isn’t actually that bad.

Finding out what’s actually damaged

We now have a fairly good assessment of the scale of the corruption but we don’t actually have a great understanding of what that corruption corresponds to in the real world. Given that we’re talking about a SharePoint content database here that binary data could correspond to a system critical .aspx page or the sales guys sharing a humorous gif:

1385331909_dancing_baby

The key areas of the error message I’m now interested are the two lines near the end which say something along the lines of:

Errors found in off-row data with ID 1311506432 owned by data record identified by RID = (1:2277271:5)

We’re also told that the only errors were found in the AllDocStreams table.

The RID referenced above is crucial because if I could determine what actual row RID (1:2277271:5) corresponds to, I have a very good chance of determining what the real world file is.

Using the undocumented sys.fn_PhysLocFormatter function we can do just that – take a low level RID and use it to identify the human readable row. The full (Sharepoint specific) query I used is as follows:

-- Identify the corrupt rows by the physical location reported by DBCC CHECKDB
-- Exclude the content column as this will probably cause a crash
SELECT [Id]
,[SiteId]
,[InternalVersion]
, sys.fn_PhysLocFormatter (%%physloc%%) AS [Physical RID]

FROM
[AllDocStreams]

WHERE
sys.fn_PhysLocFormatter (%%physloc%%) LIKE ('%(1:2282679:0)%') OR sys.fn_PhysLocFormatter (%%physloc%%) LIKE ('%(1:2277271:5)%')

The results returned 2 rows indicating that as expected, at most 2 document streams were involved.

We then did a query on the document IDs on sharepoint’s AllDocs table to get the actual locations and file names of the corrupt files.


-- Check the problem rows are readable from the corrupt table
SELECT 
	AllDocStreams.Id, AllDocStreams.SiteId
FROM AllDocStreams

WHERE Id = 'E55D0C17-239F-4F9A-8BBE-2D7B89A5FAAD' OR Id = '749E1AA8-A96F-48F8-8EEF-B40AF72477B9'


-- SharePoint Specific
-- Now we've identified the problem document *stream*, use the same ID to find out what the actual document name and location is in SharePoint
SELECT * 
FROM AllDocs

WHERE Id = 'E55D0C17-239F-4F9A-8BBE-2D7B89A5FAAD' OR Id = '749E1AA8-A96F-48F8-8EEF-B40AF72477B9'

The results can be seen below. I’ve had to obscure some of it but you can hopefully make out that the two problem files are effectively 2 Word documents.

Reverse engineering corrupt rows

So – we’ve managed to confirm that our relatively scary corruption scenario is actually just a problem with 2 files.

The structure of the database is fine and there is no nasty low level row data involvement that would lead us to be worried about row inconsistencies if we tried a repair.

With the two document locations in hand I then went into SharePoint and attempted to open the document in question. Sure enough the documents were corrupt, but crucially SharePoint didn’t have any issues attempting to open them or download them. This was a very strong indicator that the corruption was contained entirely in the files themselves and that both SharePoint and the SQL Server itself were either oblivious or unconcerned with the corruption within.

OK – so this next bit was a bit of a gamble as there was no real guarantee over what would happen. It was better than the alternatives I had on offer though so was worth a whirl.

After checking that the documents were non-critical or could be retrieved from secondary sources I simply deleted the documents from Sharepoint. I figured there was a chance that SharePoint would crash at this point but I had a strong hunch that we would get away with it. My strong feeling at this point is that the corruption is solely in the payload of the documents, and even SQL Server is unaware of the corruption at a runtime level. It is actually when you ask it to go looking via a CHECK DB that it reports the problem.

I had to delete the documents from both the document libraries, site recycle bin and the site collection recycle bin but once that was done all seemed well. I ran another quick DBCC CHECKDB and it passed without issue.

Corruption begone!

Conclusion

I’ve got to be honest, the sort of corruption that we had was at the more benign end of the spectrum. No row data or internal data structures were actually involved in the corruption. We also got lucky that a straight delete of the data from SharePoint even worked.

That said, if there is a lesson to be learned from the episode it would be – “don’t overreact”.

Check what the actual corruption is using DBCC CHECKDB and think logically about what is likely to be damaged given the sort of data your storing.

If the system is still online (which it often will be) don’t feel compelled to act – take your time, gather information and you may well find that you can do a complete repair without a moments downtime

HA/DR, SQL Server

Fixed: The sql server failover cluster services is not online SQL Server 2008 R2/Windows Server 2012 R2

When attempting to install an SQL Server 2008 R2 cluster on Windows Server 2012 R2 you may receive the error message “The sql server failover cluster services is not online” when running setup on the first node.

If you are certain that your cluster is running at the windows level and appears healthy then your issue could be that a deprecated feature of Windows Clustering has not been installed by default under Windows Server 2012. SQL Server 2008 Failover Clusters seem to depend on this feature.

To remedy the situation open an elevated powershell window and run:

Install-WindowsFeature -Name RSAT-Clustering-AutomationServer

More information can be found here: http://blogs.msdn.com/b/clustering/archive/2012/04/06/10291601.aspx

Disaster Recovery, Performance, SQL Server

Reporting DBCC SHRINKFILE Progress in SQL Server

I’ve been doing some large scale data deletion from SQL Server after some of our SharePoint databases got blown out of all proportion due to a simple versioning misconfig.

We’ve been deleting hundreds of gigs out of 2 content databases and this leaves us with the inevitable problem of needing to shrink the MDF files to get the space back. The immense fragmentation problems this will cause us is well documented, but today we have the more immediate problem of trying to tell how much longer the DBCC command is going to take.

Well, we can’t actually tell that precisely. The best we can do is make an educated guess based on how much movement is occuring in the data files via Resource Monitor. There is one other tool you can use in your guestimate in that certain types of command will report their estimated completion via the dm_exec_requests DMV. If you know how long you’ve been running for so far, and how far through processing SQL Server thinks it is, you can guess fairly accurately when you’ll be done

The following query will tell you SQL Servers best guess at how far through processing your DBCC command is:

-- SELECT all active requests across all databases. 
-- NOTE: Percentage complete is only reported for certain command types (e.g. backups, restores, DBCCs etc)
SELECT	
	sys.dm_exec_sql_text.text AS  CommandText,
	sys.dm_exec_requests.Status AS Status,
	sys.dm_exec_requests.Command AS CommandType, 
	db_name(sys.dm_exec_requests.database_id) AS DatabaseName,
	sys.dm_exec_requests.cpu_time AS CPUTime, 
	sys.dm_exec_requests.total_elapsed_time AS ElapsedTime, 
	sys.dm_exec_requests.percent_complete AS PercentageComplete
	
FROM	
	sys.dm_exec_requests 

CROSS APPLY 
	sys.dm_exec_sql_text(sys.dm_exec_requests.sql_handle) 

Note that standard DML commands (INSERT/UPDATE/DELETE etc) do not report progress as far as I’m aware. But there are a number of DBCC, INDEX and BACKUP/RESTORE commands that do and this can be useful during long running maintenance operations.

I’m not sure how accurate the percent_complete column is overall. For example during SHRINKFILE operations it seems to have a strong urge to tell me everything is around 69% complete within a few minutes, then spend another hour grinding. I’ve had more luck with accuracy when using it for backups so your mileage may vary

NOTE: This query can also act as a basic “What’s running right now” query if a server seems to be struggling. This can help identify long running queries or backups/index rebuilds that have overran the maintenance window

I hope that helps someone

Disaster Recovery, High Availability, Recovery, SQL Server

SQL Server Database Corruption – Basic Guidelines for Successful Recovery

An SQL Server corruption event can be a uniquely stressful problem for support staff. Unfortunately it frequently entails an extremely high pressure problem falling to a support engineer who is not specifically experienced in SQL Server but who is expected to perform a potentially quite sophisticated recovery with little or no notice.

This unhappy combination can lead to very negative outcomes for customers. It is very common for even an experienced SQL Server DBA acting under pressure to make the problem substantially worse than it needed to be. This usually occurs because in the absence of prior experience we will usually default to our inbuilt support instincts. Usually our support instincts serve us very well, but as we will see in the case of corruption, blindly acting on instinct will often lead to significant data loss or downtime.

This quick guide is aimed at providing some very rapid guidance on what to do and more importantly what not to do in a corruption scenario. It’s a work in progress and simply represents the items I immediately think of when a corruption occurs. Feel free to post your suggestions for additional items.

A quick note on the applicability of this advice:

  1. It assumes that you are not an expert in SQL Server. You may be considered an “accidental” DBA or someone who has been drafted in to assist simply because you have more knowledge than the next guy/gal. Each guideline below has exceptions but you really need to be quite knowledgeable in SQL Server recovery to know when it’s appropriate to deviate from the general advice below.
  2. It assumes the data in question is valuable. If you are dealing with a database where uptime is more important than minimising data loss, then consider going straight to backups now. Each customer has different needs so you need to know what’s important to them before acting.

The Guidelines – Checklist

The DON’Ts

What not to do is substantially more important that what you should do. If you remember nothing else from this document, remember the following DON’T’s. Full explanations are provided later.

  • DON’T panic – your system may well still be online
  • DON’T just wade in – you will make things worse
  • DON’T restart the server – it won’t help and will likely make things worse
  • DON’T detach the database – again, it will likely make things worse
  • DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS
  • DON’T delete or rebuild the transaction log

The Dos

Now the Do’s. These are almost always sensible steps and in a time critical situation should be done in parallel as far as is practical. Draft in help if you need to as it could considerably shorten your overall recovery time.

  • DO start a DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS immediately
  • DO get Microsoft Support involved immediately
  • DO isolate and prepare the last known good backups immediately

The Guidelines – Detailed explanations

The Don’ts of database corruption

DON’T panic – your system may still be online

Why?

In probably about 80% of database corruption scenarios I’ve been involved in, we were actually only aware of the corruption due to nightly corruption checks. If we hadn’t been running those checks we probably could have gone months without noticing the corruption from the end user perspective.

The fact is, many databases these days are big and the vast majority of the information is not touched on a daily basis. You could have a 400GB database with 64KB of corruption and it not matter a jot if its in an area of the database that doesn’t routinely get access.

The point is – just because your db is corrupt, does not mean you are dead in the water. If your systems are still up, take your time and plan the best course of action.

DON’T just wade in

Why? When a corruption event occurs, particularly in a mission critical database your first instinct is going to be to act. You can be under a lot of pressure to do something, anything that will get the customer back online. You need to ignore this pressure and your instincts because if you don’t you could make the situation worse.

Formulate a plan. The plan may be as simple as getting on the phone to Microsoft or getting in touch with a colleague that has SQL Server experience. But don’t just start acting without knowing what you’re doing, otherwise you’ll end up doing some variation of one of the DON’Ts listed below

DON’T restart the server

Why? If you know you have genuine database file corruption, the chances that rebooting the server will help matters are vanishingly small. SQL Server is not stupid – if it’s telling you that it has found corruption on disk, there is corruption on disk and far from helping matters restarting the server (or even the SQL Server service) may make matters substantially worse.

The key to understanding this is to understand that just because your database is corrupt does not mean that it is offline. The scale of corruption could be as simple as a single 8K index page out of millions meaning the vast majority of your users may be completely unaffected by the problem. It is entirely possible to recover from some types of corruption without any down time whatsoever.

When you restart the server (or service) SQL Server will perform certain consistency checks as it tries to bring the database online and if these fail, the database will be marked as suspect. If this happens you will have taken a poorly but working database and put it beyond use for all users until you can effect a repair. This is not a good outcome for the customer!

DON’T detach the database

Why? You may be tempted to start doing all sorts of things involving detaching and reattaching the database, perhaps to create an additional copy to work with. Again – don’t do this unless you know exactly what you are doing. The reasoning is similar to the previous guideline but the end result of ignoring this one is actually worse.

Again, even with corruption the database may be perfectly usable for the vast majority of users. However, SQL Server will not let you attach a database it believes to be corrupt. So if you detach it and attempt to reattach it – you may end up with a bigger problem on your hands than if you had left it alone. The way round this is known as the “hack-attach” method and is not something you want to have to figure out how to do in an emergency, especially considering all it will achieve is get you back to where you were in the first place.

DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS

Why? You are most likely to try this if you are frantically looking around the internet for advice on what to do. DO NOT follow this advice – at least not yet.

The Repair_Allow_Data_Loss option of DBCC CHECKDB is usually a last resort. It is what you do when you have no backups and are truly in dire straits. I would go as far as to say that even if you know the implications of what this command does, you should only do it under the advice of someone pretty senior at Microsoft.

The implications of running a repair with data loss are both obvious and non-obvious. The obvious implication is that you may lose data, which may or may not be a problem in itself.

The less obvious implication is that your database may now be inconsistent.

This means inter-table data dependencies may be corrupt, programmatic assumptions your line of business applications may make may no longer hold true, FK constraints may be completely violated and depending on the nature of the data this inconsistency may be substantially more problematic than an actual loss of data.

Recovery from backup is usually preferable to this option, so if anyone tells you to use it and they don’t work for Microsoft – ignore them, unless you truly understand the implications of this option.

DON’T delete or rebuild the transaction log

Why? This is another example where you’ll see an internet “guru’s” giving “advice” to people who are freaking out about log file corruption. To be sure, log file corruption isn’t great – but deleting the good transactions in the log file is probably not what we’d call “Plan A”.

If you fully rebuild the log file there is a reasonable chance that your data will be left in an inconsistent state. Any transactions that were in flight at the time of the corruption will be ripped out of the database with little regards to data consistency. This is the more benign scenario.

More problematic would be where SQL Server was performing some form of internal structural change (for example moving or allocating an index page), all of which are logged in the same way as a data change. In this scenario ripping out the log file could leave the actual database structure in a very bad way.

In most cases here the solution will be to go back to your transaction log backups.

Even if you don’t have transaction log backups, speak to MS directly to get their advice on options because rebuilding the log can be quite similar to CHECKDB REPAIR_ALLOW_DATA_LOSS in terms of its undesirable effects.

The DO’s of database corruption

The list of what you actually should do during a serious corruption event is actually reassuringly short. If you do these next three items along with none of the Don’ts listed above you should feel reassured that you have done a good job and making the most out of a pretty bad hand.

If at all possible start doing these three things in parallel. Draft in help if you can because having these actions underway simultaneously will considerably shorten your overall downtime.

DO run DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS

Why? Unless you already have the output from a recent CHECKDB run, this is probably the first thing you should do. This is because it will be essential in figuring out the extent of the damage and can take a long time to run. Microsoft will very likely want to see the output of a full run of this so you might as well start it now.

One thing to be aware of is that CHECKDB is very IO intensive and may adversely affect the performance of other databases. That said – if you need to try and do an in-place repair of the database you are going to need the output so you might as well get it over and done with. Alternatively, if the database is online and not causing widespread issues, consider leaving the CHECKDB process until it can be done out of hours.

Note that running CHECKDB without any of the repair options is completely non-destructive. It will only collect diagnostic information so is perfectly safe and should not be confused with the Repair_Allow_Data_Loss option.

DO get Microsoft Support involved

Why? The reasoning behind this should be fairly obvious. With a mission critical system you want the best possible outcome for the customer. Many system outages or data loss events can imply a  very real loss in terms of revenue, productivity or customer confidence and the cost of involving Microsoft should be seen as trivial when set against those sorts of numbers.

Even if you are confident in what you are doing and what your recovery options are, I would strongly recommend talking to a senior escalations engineer at Microsoft before committing to a recovery approach that would lose data. In a system down scenario Microsoft can usually be assisting you within 10 – 20 minutes and will give you the confidence that you are taking the most appropriate steps given the customer’s particular circumstances.

DO isolate and localise backups

Why? Many, but not all corruption events require a recovery of some sort from a recent backup. This does not imply immediately going to last night’s full backup and swallowing a day’s data loss. If someone immediately starts talking about restoring from last nights backup without quantifying the exact nature of the corruption – ignore them.  It bears repeating that under many circumstances with sufficient transaction log backups, a recovery of a small number of pages can be implemented with no down time or data loss whatsoever.

The key to minimising the impact and recovery time is to isolate the most recent backups and prepare them for use. In particular, if your backups are on tape or would need to be retrieved across a 100Mb LAN – if possible try and get them “closer” to the server. The time you are running CHECKDB and speaking to MS on the phone is potentially “dead” time in which you could be doing this. This could be as simple as moving the backups from a slow NAS or tape device to a local drive on the server. Every customer environment is slightly different but think about where the backups are and whether there is any way you could make a potential restore faster by pre-positioning the backups in a better location.

Summary

The range of possible variables during a corruption event are very broad and the choices you make can have a very real impact on the outcome.

The first thing to understand is that some corruptions can be really quite benign and if handled correctly would result in no data loss and no downtime. An example of this would be if the corruption existed only in a non-clustered index and not within an actual data page. In this case the index could be dropped and rebuilt and the impact on the customer would be very limited.

Equally, corruption of a small number of data pages could potentially be repaired online with little or no data loss by doing a selective page restore.

It is certainly true that serious data corruptions can occur and need to be taken very seriously. The key to a successful outcome in all cases though is first and foremost to do no harm. Don’t act instinctively and wherever possible involve Microsoft or an SQL Server expert at the earliest opportunity.

Update: Brent Ozar recently provided a mailshot with a couple of usefull additional tips in it. You can read it here: What to Do When DBCC CHECKDB Reports Corruption

SQL Server

The only Oracle feature I actually still want

Introduction

I really like SQL Server. Early in my career I had more than one opportunity to take the Oracle path, but there was always something about SQL Server’s ease of use and “good enough” approach to performance and high availability that I liked. That and the fact that most of our customers didn’t have 250K to spend on a “basic” database setup pretty much settled matters.

Although it has become increasingly rare, I am very occasionally hit by an acute episode of database envy.

As of SQL Server 2014 I think there is really only one Oracle feature that I would absolutely love in SQL Server. There are other features I think are cool (genuine scale-out clustering for example). But this the only feature I want that could seriously make me more relaxed about using SQL Server in high availability setups.

Oracle Flashback Technology

Oracle Flashback Technology is best demonstrated via some very simple code:


SELECT * FROM Orders
AS OF TIMESTAMP        -- SAY WHAT!!!!
TO_TIMESTAMP('2014-04-23 09:30:00', 'YYYY-MM-DD HH:MI:SS')

 

Now – allow that code to just detonate in your brain for a second… It’s allowing you to query the live system as it was at some point in the past.

In day to day development use that code could probably be described as “quite handy”.

In a VLDB, 24/7 environment that code is absolute feature dynamite.

Why does this matter?

To picture why this feature is bordering on essential in a VLDB 24/7 environment, imagine what you would be faced with doing in the following scenario:

  • You have a 1 Terrabyte OLTP system that needs to be available 24/7
  • A power user or developer advises you that they have accidentally modified a substantial number of some extremely important records
  • Your primary 24/7 system is now offline or at best severely impaired

What are your options in SQL Server land?

  • You have a highly available system using some clever combination of clustering, synchronous mirroring and maybe even asynchronous mirroring to a secondary data centre. However, by the time you find out about the problem the corruption has already been sitting replicated at the secondary location for half an hour – so you’re bang out of luck there.
  • You have a policy of using database snapshots when applying critical database updates internally. However, this was not a scheduled update. It was a power user gone rogue and so there is no useful snapshot that can be used to help extricate us from the situation.
  • You are backed up to the hilt and could theoretically identify the exact moment the corruption occurred. You could even theoretically recover to just before that point. But it’s a 24/7 system – you can’t just stop the business whilst you perform a restore over the live database.
  • You could recover to a secondary location and repair the live system by hand using  T-SQL. But, we are dealing with 1000 gigabytes of data. Do you have a terrabyte of SQL Server storage sitting around for just this eventuality? Assuming you do, do you have an abundance of time on your hands where you can sit back and wait for a restore to complete? It’s going to take many, many hours so I hope the answer is yes!

The bottom line is that SQL Server does not have any good options for this sort of scenario.

Off the top of my head, to have any sort of solution at all to this problem you could look at something like log-shipping-with-delay to give yourself at least a fighting chance. I’ve also heard of people taking hourly snapshots using schedules and scripting, just in case someone deletes something. That doesn’t make me very happy either.

Unfortunately the options available do not even come close to the simplicity and immediacy of Oracle’s Flashback feature.

I don’t know the particulars of how Flashback is implemented in Oracle. But however its done its smart as hell. It’s also the one remaining feature I secretly covet for SQL Server and I really hope MS do something similar one day.

High Availability, Recovery, SQL Server

Rookie SQL Server Mistake: Check your editions BEFORE doing a massive restore

A quick tip to anyone doing a large restore of SQL Server databases – check the two editions BEFORE starting. By edition, I don’t mean 2012 -> 2014, I mean Enterprise Edition-> Whatever

I recently had to backup, transfer then restore a 300GB database. Checked the versions – 2008-> 2008 no problem. So off it went.

An hour and a half passes and then I get:

Database ‘<Database Name>’ cannot be started in this edition of SQL Server because it contains a partition function ‘AuditPFN’. Only Enterprise edition of SQL Server supports partitioning. 

SQL Server Editions Gotcha

Can’t really argue with the facts there. Partitioning is indeed an Enterprise Edition only feature and I was indeed trying to restore to a Standard Edition install. I’m normally pretty good about checking my assumptions before starting but must have been having a senior moment today.

My only real beef with the error message, and the reason for this post is to warn people that you have to wait for the restore to fully complete before it will tell you that it was never going to work. This is much more annoying when restoring larger databases.

In my case it isn’t a big deal, just some wasted time. I’ll just have to install Developer Edition on my machine, which is something I should have done originally anyway.

The scenario that is a bit scarier is when you are doing a huge restore for real – in full on crisis mode. In this case – make absolutely certain that you are restoring to an appropriate edition before starting. In an emergency you won’t have an additional X number of hours to discover your restore was never going to work in the first place.