.net, Troubleshooting, Visual Studio

Stopping vshub Requests Spamming Fiddler

I’ve been having some issues lately when debugging ASP.net web applications using Visual Studio and Fiddler. If you use Fiddler whilst running the app in debug mode, you’ll find it’s being spammed with traffic being sent to Visual Studio as your app runs.

The key symptom is you have hundreds of requests that look something like:

http://localhost:49161/vshub/bb195f2e0d5c4765b9411f7bf479d012/DataWarehouseModule/dataWarehouse/getResult/

vshubIssue

Aside from being slightly irritating, the volume of traffic is sufficient to make Fiddler borderline useless as all genuine traffic will be drowned out.

I didn’t have much luck applying standard Fiddler Filters for some reason. Most likely I was just wrong on the syntax for the exclusion.

Digging in a bit further, the traffic is being sent to the Diagnostic Tools window in Visual Studio and is used to populate those kinda cool/kinda horrendously distracting memory and processor charts.

Given that I don’t spend my entire Visual Studio day preoccupied with memory usage and CPU load it seems a little overkill to be sending that data by default. As such, to stop the spamming, open the Diagnostic Tools window (Ctrl + Alt + F2) and deselect the Memory Usage and CPU Usage tools as shown below:

Diagnostic Tools Window

This should stop the spam traffic, but you will need to remember to manually re-enable the Diagnostic Tools if you are doing some profiling that requires them.

 

sharepoint, sharepoint 2013

Recursively Copy Document Libraries from SharePoint 2010 to SharePoint 2013

We’ve recently been looking at migrating quite a large amount of content from our legacy SharePoint 2010 Farm to a new clean SharePoint 2013 installation. For a variety of reasons we didn’t want to go the whole content database->site collection upgrade approach, not least of all because we really wanted a clean break from what has been quite a naughty SharePoint 2010 farm. The thing is creaking, it hates all forms of human life and is quite possibly haunted. Because of this we wanted to bring the binary file data across and essentially nothing else.

Migrating documents around would seem to be one of those things that an Enterprise class document management platform should have nailed. However it turns out, its not quite as straightforward as you might think.

Perhaps I’m asking too much, but there didn’t seem to be any simple way of saying, “recursively copy all that, and place it there”.

I came across scores of roll your own scripts, C# apps and seriously expensive bits of software that would charge me by the gigabyte. For various reasons most of the approaches I tried didn’t pan out. We either had to spend a ton of cash to get a pro piece of software to (presumably) do what we needed, or we could spend a lot of time writing our own recursive code via either Powershell or C# and attempt to work around the fact that there is no real recursive copy function in the SharePoint APIs – at least not one that I found.

The approach I ended up going with is undoubtedly brute force, but has been extremely effective. The process is essentially a powershell script that does the following:

  1. Loop through all the target document libraries at the source site
  2. Dynamically create a *mapped drive* to both the Source and Destination locations
  3. Robocopy the files between the two mapped folders – taking advantage of the fact that robocopy doesn’t care that this is sharepoint behind the scenes and handles recursive copy like a walk in the park
  4. Wonder why that wasn’t considerably easier

There is one caveat here – because we were rearranging our library structure, I had already pre-created the destination document libraries. You may well not be in this scenario, in which case you will need to tweak the script to potentially create a document library at the correct location as you go. This would probably be a one line change.

The following script will need to be tweaked to get your source and destination library locations lining up. It won’t work right off the bat, but I did want to provide as a sample to demonstrate that powershell + robocopy can be used to migrate a large amount of content as it took me waaaayy to long to get to this point (thanks MS)


cls

$currentSourceDriveMapping = ""
$currentDestinationDriveMapping = ""

$sourceWeb = Get-SPWeb "http://mysourcelocation"
$destinationWeb =  Get-SPWeb "http://mydestinationsite"

Write-Host "Connected to $sourceWeb"
Write-Host "Connected to $destinationWeb"

Write-Host "Releasing mapped drives"

$sourceLibraries = $sourceWeb.Lists

Write-Host $sourceLibraries.Count + " document libraries found at source location"

$listTemplate = [Microsoft.SharePoint.SPListTemplateType]::DocumentLibrary

$processedCounter = 1

foreach($currentSourceLibrary in $sourceLibraries){
	
	Write-Host "Mapping Source Library to S drive"
	
	$currentSourceDriveMapping = "http://mysourcelocation" + $currentSourceLibrary.RootFolder.ServerRelativeUrl
		
	Write-Host $currentSourceDriveMapping
	
	# net use will create a mapped drive using the sharepoint location provided 
	net use s: "$currentSourceDriveMapping"
	
	Write-Host "Mapping Source Library to T drive"
		
	# NOTE: Voodoo here that won't apply to you - this made sense in my environment due to restructuring that we were doing
	$currentDestinationDriveMapping = $destinationWeb.Url + "/" + $currentSourceLibrary.Description + "/Tech Notes"
	
	Write-Host $currentDestinationDriveMapping
	
	net use t: "$currentDestinationDriveMapping"
	
	Write-Host "Mapping Source Library to T drive"
	
	# Robocopy all folders apart from the sharepoint specific Forms folder at the root of the source library
	# Note that robocopy is ideal for this as it already implictly handles recursive copying
	robocopy S: T: /e /xd s:\Forms
	
	Write-Host "Releasing mapped drives"
	net use s: /delete
	net use t: /delete
	
	Write-Host $processedCounter " records libraries processed successfully"
	
	$processedCounter++

}

I’ve used this script to migrate about 25GB worth of content in 2 hours. It’s not lightning fast, but it got the job done.

There are likely numerous caveats that may apply to you but didn’t to me. Amongst them would be the fact that any relvant permissions, history, properties will almost certainly not be copied over.

If you just need a fairly straight recursive dump of content though this may give you some pointers

Disaster Recovery, High Availability, SQL Server, SQL Server 2016

Temporal Tables Coming in SQL Server 2016

A while back I wrote a piece about how Oracle Flashback was the only Oracle feature remaining that I would love for SQL Server. This after many years early in my career where I was frequently hit with occasional bouts of database envy.

You can see the full details of why I hanker for this feature here, https://rule30.wordpress.com/2014/04/23/the-only-oracle-feature-i-actually-still-want/ – but in short its because it lets you do this:

SELECT * FROM Orders
AS OF TIMESTAMP -- Say Wwwwhat!!
TO_TIMESTAMP('2014-04-23 09:30:00', 'YYYY-MM-DD HH:MI:SS')

The great news is that Satya Nadella has clearly been checking out the blog here and “flashback” queries are going to be included in SQL Server 2016 under the name Temporal Tables or System Versioned tables. Full details here:

Temporal Tables in SQL Server 2016

The key syntax is shown below:

-- To the Delorian!
SELECT * FROM Orders 
FOR SYSTEM_TIME AS OF '2014-04-23 09:30:00' 

Fantastic! The key use case I have currently is in providing a recovery option for managing VLDBs however there are very obvious applications for this in BI and Auditing considering the feature has been implemented as a simple first class concept in standard T-SQL.

A couple of observations:

  • As of CTP 2 Management Studio tooling doesn’t seem to have any support for enabling temporal/system versioned tables. Opting to create a System Versioned table throws you out to a standard coding window where you’ll have to get your DDL on. I’m guessing the tooling will catch up before release
  • I’m really liking the simplicity of the implementation. Enabling and managing Flashback in Oracle seemed like a bit of an ordeal to me so MS are really playing to their strengths. Once created in SQL Server a temporal table is really just a standard table with a secondary hidden table being maintained for you behind the scenes – its that simple
  • I’m not clear which versions of SQL Server will come with temporal tables. It “feels like” an Enterprise only feature given its high availability/big data applications but I haven’t seen anything on that yet. Even at SQL Server Enterprise pricing though its almost certainly going to be a bargain compared to Oracle.
Corruption, SQL Server

Handling real world corruption of a SharePoint Content Database

You never want to come in to work to find that one of your critical SQL Server databases is corrupt but that’s exactly what happened today due to a SAN issue. I’ve got a fair amount of SQL Server experience but not so much so that a failed SQL Server corruption check doesn’t instinctively cause me to wonder out loud how up to date my CV is  😉

That said we did manage to repair the corruption with no downtime and no data loss so I thought I’d do a quick blog to describe what I did to resolve the issue in case it helps anyone who has a similar issue.

Its important to understand that there are a huge variety of corruption scenarios that can occur in the real world and the lessons here may well not apply to you. Our DB failed in a very specific way that lent itself to a relatively quick repair with no downtime. You may be able to take some of the ideas here and run with them though so I thought it worth a share.

What did SQL Server think was wrong?

We picked up on the problem when one of our custom maintenance jobs emailed me the results of a failed DBCC CHECKDB command. These corruption checks are crucial as they tell you about problems whilst you still have enough backups in rotation to be able to do something about them. I’ve actually gone as far as running them nightly using some hand crafted scripts that will email step by step guidance to the IT team on how to handle the problem in case I’m on holiday. If you aren’t doing frequent DBCC  CHECKDBs, you really need to start now.

The first thing we want to do is find out what SQL Server thinks is wrong. The key thing to do here is not panic. The SharePoint system in question was clattering a long quite happily at this point so there is plenty of time to gather diagnostics.

The first thing we want to do is to run a full CHECKDB. This will give us SQL Server’s assessment of what’s wrong and would likely be the first thing MS would ask for if you called them:

DBCC CHECKDB ('<database.') with no_infomsgs

This gave us the following output, giving us our initial handle on how severe the problem was (apologies for the formatting):

Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373408), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373409), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373410), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373411), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373412), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373413), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373414), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8961, Level 16, State 1, Line 1
Table error: Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594040549376 (type LOB data). The off-row data node at page (1:373415), slot 0, text ID 333774848 does not match its reference from page (1:46982), slot 0.
Msg 8929, Level 16, State 1, Line 1
Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594094747648 (type In-row data): Errors found in off-row data with ID 1311506432 owned by data record identified by RID = (1:2277271:5)
Msg 8929, Level 16, State 1, Line 1
Object ID 373576369, index ID 1, partition ID 72057594081902592, alloc unit ID 72057594094747648 (type In-row data): Errors found in off-row data with ID 333774848 owned by data record identified by RID = (1:2282679:0)
CHECKDB found 0 allocation errors and 10 consistency errors in table 'AllDocStreams' (object ID 373576369).
CHECKDB found 0 allocation errors and 10 consistency errors in database 'WSS_Content_CustomerDocs'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (WSS_Content_CustomerDocs).

Seems fairly scary but I actually started to relax a little once I read this, for reasons that will become apparent.

The first thing to understand about this database is that it is a large sharepoint content database.

This is actually great news because it means that the *vast* majority of the data in the db is not actually *row* data, it’s going to be BLOB data that represents the various files in SharePoint. I say good because in a 100GB content database, sheer probability suggests that if we did get a small corruption it would very likely fall within the BLOB data of a single file rather than row data or the intrinsic data structures of SQL Server. Actual corruption to rows or intrinsic data structures are far more problematic and I’d gladly write off a word document or powerpoint presentation if it meant not having to deal with that 😉

Reading through the output we can see that SQL Server is having an issue reconciling the row data stored on a single 8K page with the binary LOB data it stores “off row”. This tells me the following:

  • The root issue is with a single 8K page
  • The broader issue with that 8K page is that some pointers to the “off row” data seems a bit messed up.
  • Given the number of “off row” pages involved (8), the corruption seems to be very limited in scope – probably about 64KB given an 8K page size. It’s quite likely in fact that if we could pin it down, we might only be dealing with one or two corrupt files

In the grand scheme of things, this isn’t actually that bad.

Finding out what’s actually damaged

We now have a fairly good assessment of the scale of the corruption but we don’t actually have a great understanding of what that corruption corresponds to in the real world. Given that we’re talking about a SharePoint content database here that binary data could correspond to a system critical .aspx page or the sales guys sharing a humorous gif:

1385331909_dancing_baby

The key areas of the error message I’m now interested are the two lines near the end which say something along the lines of:

Errors found in off-row data with ID 1311506432 owned by data record identified by RID = (1:2277271:5)

We’re also told that the only errors were found in the AllDocStreams table.

The RID referenced above is crucial because if I could determine what actual row RID (1:2277271:5) corresponds to, I have a very good chance of determining what the real world file is.

Using the undocumented sys.fn_PhysLocFormatter function we can do just that – take a low level RID and use it to identify the human readable row. The full (Sharepoint specific) query I used is as follows:

-- Identify the corrupt rows by the physical location reported by DBCC CHECKDB
-- Exclude the content column as this will probably cause a crash
SELECT [Id]
,[SiteId]
,[InternalVersion]
, sys.fn_PhysLocFormatter (%%physloc%%) AS [Physical RID]

FROM
[AllDocStreams]

WHERE
sys.fn_PhysLocFormatter (%%physloc%%) LIKE ('%(1:2282679:0)%') OR sys.fn_PhysLocFormatter (%%physloc%%) LIKE ('%(1:2277271:5)%')

The results returned 2 rows indicating that as expected, at most 2 document streams were involved.

We then did a query on the document IDs on sharepoint’s AllDocs table to get the actual locations and file names of the corrupt files.


-- Check the problem rows are readable from the corrupt table
SELECT 
	AllDocStreams.Id, AllDocStreams.SiteId
FROM AllDocStreams

WHERE Id = 'E55D0C17-239F-4F9A-8BBE-2D7B89A5FAAD' OR Id = '749E1AA8-A96F-48F8-8EEF-B40AF72477B9'


-- SharePoint Specific
-- Now we've identified the problem document *stream*, use the same ID to find out what the actual document name and location is in SharePoint
SELECT * 
FROM AllDocs

WHERE Id = 'E55D0C17-239F-4F9A-8BBE-2D7B89A5FAAD' OR Id = '749E1AA8-A96F-48F8-8EEF-B40AF72477B9'

The results can be seen below. I’ve had to obscure some of it but you can hopefully make out that the two problem files are effectively 2 Word documents.

Reverse engineering corrupt rows

So – we’ve managed to confirm that our relatively scary corruption scenario is actually just a problem with 2 files.

The structure of the database is fine and there is no nasty low level row data involvement that would lead us to be worried about row inconsistencies if we tried a repair.

With the two document locations in hand I then went into SharePoint and attempted to open the document in question. Sure enough the documents were corrupt, but crucially SharePoint didn’t have any issues attempting to open them or download them. This was a very strong indicator that the corruption was contained entirely in the files themselves and that both SharePoint and the SQL Server itself were either oblivious or unconcerned with the corruption within.

OK – so this next bit was a bit of a gamble as there was no real guarantee over what would happen. It was better than the alternatives I had on offer though so was worth a whirl.

After checking that the documents were non-critical or could be retrieved from secondary sources I simply deleted the documents from Sharepoint. I figured there was a chance that SharePoint would crash at this point but I had a strong hunch that we would get away with it. My strong feeling at this point is that the corruption is solely in the payload of the documents, and even SQL Server is unaware of the corruption at a runtime level. It is actually when you ask it to go looking via a CHECK DB that it reports the problem.

I had to delete the documents from both the document libraries, site recycle bin and the site collection recycle bin but once that was done all seemed well. I ran another quick DBCC CHECKDB and it passed without issue.

Corruption begone!

Conclusion

I’ve got to be honest, the sort of corruption that we had was at the more benign end of the spectrum. No row data or internal data structures were actually involved in the corruption. We also got lucky that a straight delete of the data from SharePoint even worked.

That said, if there is a lesson to be learned from the episode it would be – “don’t overreact”.

Check what the actual corruption is using DBCC CHECKDB and think logically about what is likely to be damaged given the sort of data your storing.

If the system is still online (which it often will be) don’t feel compelled to act – take your time, gather information and you may well find that you can do a complete repair without a moments downtime

azure, virtual machines

Assigning a Static IP Address on an Azure VM

Quick tip: if you need to set a static IP address on a VM running in Azure you can use the following snippets at the Azure Powershell:


# Determine whether the target IP address is in use on Virtual Network "SPnet"
Test-AzureStaticVNetIP –VNetName SPnet –IPAddress 10.0.0.5

# Give machine "sp1" running on azure service "XXX" a static IP address of 10.0.0.5
# Note that I had the VM switched off when I did this. I'm not clear on whether that's necessary -  the machine may need a reboot if its on
Get-AzureVM -ServiceName XXX -Name sp1 | Set-AzureStaticVNetIP -IPAddress 10.0.0.5 | Update-AzureVM

I’m not actually sure if it’s possible to just switch DHCP off on the machine directly. It occurs to me that it might work, but because the azure network infrastructure won’t know that machine is bagging it’s own IPs, it might get upset.

azure, sharepoint

Creating a SharePoint 2013 Farm in Azure in Under 15 Minutes

This is going to be a very quick post that I hope helps a few people who want to be able to spin up a new SharePoint 2013 farm without it becoming an almighty ordeal involving SQL Servers, Domain Controllers and dodgy powershell scripts downloaded from the internet that don’t actually work 😉

One of the coolest features of the new Azure Portal Microsoft has been trialing is the ability to generate entire multi-server environments, preconfigured from a template. Unless I’d missed it, this wasn’t a feature of the old azure portal – at least not as far as SharePoint was concerned. We absolutely could spin up individual machines quite quickly but they were very much isolated and all the networking, DC configuration and product installations needed to be done by hand. This wasn’t much fun and was pretty much missing out on a huge potential win for IAAS.

Check out the new portal here: portal.azure.com

I’ll let you figure out the UI. It took a bit of clicking around but I now much prefer it to the old portal.

You want to find the new “SharePoint Server Farm” option. This simple menu item belies an incredible time saving in setting up a typical SharePoint OnPrem instance for test or demo purposes.

The New

The only thing I want to highlight specifically is the “Enable High Availability” option. That option is both very cool, and very expensive. Tick it and it will create a 9 server, fully resilient farm in under 15 minutes complete with multiple DCs, front ends and domain controllers. I’m not shitting you – just leave your credit card behind the bar though.

Leave it unticked you will get the still perfectly serviceable but much more cost effective 3 server option. SP FE + DC + SQL.

Whip through the wizard and you will very quickly have a full SharePoint 2013 setup that is fully accessible via the internet. The DC, DNS, SQL Server and Private/Public networking will be preconfigured for you and both Central Admin and a single web app will be waiting for you when you log in for the first time.

A couple of tips if you are trying it for the first time:

  • Some of the machine sizings seem a little off to me. I’d encourage you to go through the wizard options carefully and tweak the size of the machines to what you actually need. Some of the machines seemed a little small to me, whilst others seemed overspec’d. Choose carefully as it will have cost implications.
  • The first run through of the wizard actually failed for me. Something to do with a timeout when creating the SQL Server. I got the impression there is a lot of Azure Powershell flying about to create and configure these machines. Perseverance paid off however because once I had nuked the first set of machines, it worked like a charm on the second go.

Really looking forward to more complex templates coming to azure in the future!

Disaster Recovery, Performance, SQL Server

Reporting DBCC SHRINKFILE Progress in SQL Server

I’ve been doing some large scale data deletion from SQL Server after some of our SharePoint databases got blown out of all proportion due to a simple versioning misconfig.

We’ve been deleting hundreds of gigs out of 2 content databases and this leaves us with the inevitable problem of needing to shrink the MDF files to get the space back. The immense fragmentation problems this will cause us is well documented, but today we have the more immediate problem of trying to tell how much longer the DBCC command is going to take.

Well, we can’t actually tell that precisely. The best we can do is make an educated guess based on how much movement is occuring in the data files via Resource Monitor. There is one other tool you can use in your guestimate in that certain types of command will report their estimated completion via the dm_exec_requests DMV. If you know how long you’ve been running for so far, and how far through processing SQL Server thinks it is, you can guess fairly accurately when you’ll be done

The following query will tell you SQL Servers best guess at how far through processing your DBCC command is:

-- SELECT all active requests across all databases. 
-- NOTE: Percentage complete is only reported for certain command types (e.g. backups, restores, DBCCs etc)
SELECT	
	sys.dm_exec_sql_text.text AS  CommandText,
	sys.dm_exec_requests.Status AS Status,
	sys.dm_exec_requests.Command AS CommandType, 
	db_name(sys.dm_exec_requests.database_id) AS DatabaseName,
	sys.dm_exec_requests.cpu_time AS CPUTime, 
	sys.dm_exec_requests.total_elapsed_time AS ElapsedTime, 
	sys.dm_exec_requests.percent_complete AS PercentageComplete
	
FROM	
	sys.dm_exec_requests 

CROSS APPLY 
	sys.dm_exec_sql_text(sys.dm_exec_requests.sql_handle) 

Note that standard DML commands (INSERT/UPDATE/DELETE etc) do not report progress as far as I’m aware. But there are a number of DBCC, INDEX and BACKUP/RESTORE commands that do and this can be useful during long running maintenance operations.

I’m not sure how accurate the percent_complete column is overall. For example during SHRINKFILE operations it seems to have a strong urge to tell me everything is around 69% complete within a few minutes, then spend another hour grinding. I’ve had more luck with accuracy when using it for backups so your mileage may vary

NOTE: This query can also act as a basic “What’s running right now” query if a server seems to be struggling. This can help identify long running queries or backups/index rebuilds that have overran the maintenance window

I hope that helps someone

Disaster Recovery, High Availability, Recovery, SQL Server

SQL Server Database Corruption – Basic Guidelines for Successful Recovery

An SQL Server corruption event can be a uniquely stressful problem for support staff. Unfortunately it frequently entails an extremely high pressure problem falling to a support engineer who is not specifically experienced in SQL Server but who is expected to perform a potentially quite sophisticated recovery with little or no notice.

This unhappy combination can lead to very negative outcomes for customers. It is very common for even an experienced SQL Server DBA acting under pressure to make the problem substantially worse than it needed to be. This usually occurs because in the absence of prior experience we will usually default to our inbuilt support instincts. Usually our support instincts serve us very well, but as we will see in the case of corruption, blindly acting on instinct will often lead to significant data loss or downtime.

This quick guide is aimed at providing some very rapid guidance on what to do and more importantly what not to do in a corruption scenario. It’s a work in progress and simply represents the items I immediately think of when a corruption occurs. Feel free to post your suggestions for additional items.

A quick note on the applicability of this advice:

  1. It assumes that you are not an expert in SQL Server. You may be considered an “accidental” DBA or someone who has been drafted in to assist simply because you have more knowledge than the next guy/gal. Each guideline below has exceptions but you really need to be quite knowledgeable in SQL Server recovery to know when it’s appropriate to deviate from the general advice below.
  2. It assumes the data in question is valuable. If you are dealing with a database where uptime is more important than minimising data loss, then consider going straight to backups now. Each customer has different needs so you need to know what’s important to them before acting.

The Guidelines – Checklist

The DON’Ts

What not to do is substantially more important that what you should do. If you remember nothing else from this document, remember the following DON’T’s. Full explanations are provided later.

  • DON’T panic – your system may well still be online
  • DON’T just wade in – you will make things worse
  • DON’T restart the server – it won’t help and will likely make things worse
  • DON’T detach the database – again, it will likely make things worse
  • DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS
  • DON’T delete or rebuild the transaction log

The Dos

Now the Do’s. These are almost always sensible steps and in a time critical situation should be done in parallel as far as is practical. Draft in help if you need to as it could considerably shorten your overall recovery time.

  • DO start a DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS immediately
  • DO get Microsoft Support involved immediately
  • DO isolate and prepare the last known good backups immediately

The Guidelines – Detailed explanations

The Don’ts of database corruption

DON’T panic – your system may still be online

Why?

In probably about 80% of database corruption scenarios I’ve been involved in, we were actually only aware of the corruption due to nightly corruption checks. If we hadn’t been running those checks we probably could have gone months without noticing the corruption from the end user perspective.

The fact is, many databases these days are big and the vast majority of the information is not touched on a daily basis. You could have a 400GB database with 64KB of corruption and it not matter a jot if its in an area of the database that doesn’t routinely get access.

The point is – just because your db is corrupt, does not mean you are dead in the water. If your systems are still up, take your time and plan the best course of action.

DON’T just wade in

Why? When a corruption event occurs, particularly in a mission critical database your first instinct is going to be to act. You can be under a lot of pressure to do something, anything that will get the customer back online. You need to ignore this pressure and your instincts because if you don’t you could make the situation worse.

Formulate a plan. The plan may be as simple as getting on the phone to Microsoft or getting in touch with a colleague that has SQL Server experience. But don’t just start acting without knowing what you’re doing, otherwise you’ll end up doing some variation of one of the DON’Ts listed below

DON’T restart the server

Why? If you know you have genuine database file corruption, the chances that rebooting the server will help matters are vanishingly small. SQL Server is not stupid – if it’s telling you that it has found corruption on disk, there is corruption on disk and far from helping matters restarting the server (or even the SQL Server service) may make matters substantially worse.

The key to understanding this is to understand that just because your database is corrupt does not mean that it is offline. The scale of corruption could be as simple as a single 8K index page out of millions meaning the vast majority of your users may be completely unaffected by the problem. It is entirely possible to recover from some types of corruption without any down time whatsoever.

When you restart the server (or service) SQL Server will perform certain consistency checks as it tries to bring the database online and if these fail, the database will be marked as suspect. If this happens you will have taken a poorly but working database and put it beyond use for all users until you can effect a repair. This is not a good outcome for the customer!

DON’T detach the database

Why? You may be tempted to start doing all sorts of things involving detaching and reattaching the database, perhaps to create an additional copy to work with. Again – don’t do this unless you know exactly what you are doing. The reasoning is similar to the previous guideline but the end result of ignoring this one is actually worse.

Again, even with corruption the database may be perfectly usable for the vast majority of users. However, SQL Server will not let you attach a database it believes to be corrupt. So if you detach it and attempt to reattach it – you may end up with a bigger problem on your hands than if you had left it alone. The way round this is known as the “hack-attach” method and is not something you want to have to figure out how to do in an emergency, especially considering all it will achieve is get you back to where you were in the first place.

DON’T run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS

Why? You are most likely to try this if you are frantically looking around the internet for advice on what to do. DO NOT follow this advice – at least not yet.

The Repair_Allow_Data_Loss option of DBCC CHECKDB is usually a last resort. It is what you do when you have no backups and are truly in dire straits. I would go as far as to say that even if you know the implications of what this command does, you should only do it under the advice of someone pretty senior at Microsoft.

The implications of running a repair with data loss are both obvious and non-obvious. The obvious implication is that you may lose data, which may or may not be a problem in itself.

The less obvious implication is that your database may now be inconsistent.

This means inter-table data dependencies may be corrupt, programmatic assumptions your line of business applications may make may no longer hold true, FK constraints may be completely violated and depending on the nature of the data this inconsistency may be substantially more problematic than an actual loss of data.

Recovery from backup is usually preferable to this option, so if anyone tells you to use it and they don’t work for Microsoft – ignore them, unless you truly understand the implications of this option.

DON’T delete or rebuild the transaction log

Why? This is another example where you’ll see an internet “guru’s” giving “advice” to people who are freaking out about log file corruption. To be sure, log file corruption isn’t great – but deleting the good transactions in the log file is probably not what we’d call “Plan A”.

If you fully rebuild the log file there is a reasonable chance that your data will be left in an inconsistent state. Any transactions that were in flight at the time of the corruption will be ripped out of the database with little regards to data consistency. This is the more benign scenario.

More problematic would be where SQL Server was performing some form of internal structural change (for example moving or allocating an index page), all of which are logged in the same way as a data change. In this scenario ripping out the log file could leave the actual database structure in a very bad way.

In most cases here the solution will be to go back to your transaction log backups.

Even if you don’t have transaction log backups, speak to MS directly to get their advice on options because rebuilding the log can be quite similar to CHECKDB REPAIR_ALLOW_DATA_LOSS in terms of its undesirable effects.

The DO’s of database corruption

The list of what you actually should do during a serious corruption event is actually reassuringly short. If you do these next three items along with none of the Don’ts listed above you should feel reassured that you have done a good job and making the most out of a pretty bad hand.

If at all possible start doing these three things in parallel. Draft in help if you can because having these actions underway simultaneously will considerably shorten your overall downtime.

DO run DBCC CHECKDB with ALL_ERRORMSGS and NO_INFOMSGS

Why? Unless you already have the output from a recent CHECKDB run, this is probably the first thing you should do. This is because it will be essential in figuring out the extent of the damage and can take a long time to run. Microsoft will very likely want to see the output of a full run of this so you might as well start it now.

One thing to be aware of is that CHECKDB is very IO intensive and may adversely affect the performance of other databases. That said – if you need to try and do an in-place repair of the database you are going to need the output so you might as well get it over and done with. Alternatively, if the database is online and not causing widespread issues, consider leaving the CHECKDB process until it can be done out of hours.

Note that running CHECKDB without any of the repair options is completely non-destructive. It will only collect diagnostic information so is perfectly safe and should not be confused with the Repair_Allow_Data_Loss option.

DO get Microsoft Support involved

Why? The reasoning behind this should be fairly obvious. With a mission critical system you want the best possible outcome for the customer. Many system outages or data loss events can imply a  very real loss in terms of revenue, productivity or customer confidence and the cost of involving Microsoft should be seen as trivial when set against those sorts of numbers.

Even if you are confident in what you are doing and what your recovery options are, I would strongly recommend talking to a senior escalations engineer at Microsoft before committing to a recovery approach that would lose data. In a system down scenario Microsoft can usually be assisting you within 10 – 20 minutes and will give you the confidence that you are taking the most appropriate steps given the customer’s particular circumstances.

DO isolate and localise backups

Why? Many, but not all corruption events require a recovery of some sort from a recent backup. This does not imply immediately going to last night’s full backup and swallowing a day’s data loss. If someone immediately starts talking about restoring from last nights backup without quantifying the exact nature of the corruption – ignore them.  It bears repeating that under many circumstances with sufficient transaction log backups, a recovery of a small number of pages can be implemented with no down time or data loss whatsoever.

The key to minimising the impact and recovery time is to isolate the most recent backups and prepare them for use. In particular, if your backups are on tape or would need to be retrieved across a 100Mb LAN – if possible try and get them “closer” to the server. The time you are running CHECKDB and speaking to MS on the phone is potentially “dead” time in which you could be doing this. This could be as simple as moving the backups from a slow NAS or tape device to a local drive on the server. Every customer environment is slightly different but think about where the backups are and whether there is any way you could make a potential restore faster by pre-positioning the backups in a better location.

Summary

The range of possible variables during a corruption event are very broad and the choices you make can have a very real impact on the outcome.

The first thing to understand is that some corruptions can be really quite benign and if handled correctly would result in no data loss and no downtime. An example of this would be if the corruption existed only in a non-clustered index and not within an actual data page. In this case the index could be dropped and rebuilt and the impact on the customer would be very limited.

Equally, corruption of a small number of data pages could potentially be repaired online with little or no data loss by doing a selective page restore.

It is certainly true that serious data corruptions can occur and need to be taken very seriously. The key to a successful outcome in all cases though is first and foremost to do no harm. Don’t act instinctively and wherever possible involve Microsoft or an SQL Server expert at the earliest opportunity.

Update: Brent Ozar recently provided a mailshot with a couple of usefull additional tips in it. You can read it here: What to Do When DBCC CHECKDB Reports Corruption

Disaster Recovery, Recovery, SQL Server

Sending A Custom Email on DBCC CHECKDB Failure

With SQL Server Maintenance Plans it’s quite possible to run scheduled DBCC CHECKDB consistency checks via an agent job and be emailed when the job fails. Unfortunately the message that gets sent is the generic agent notification which doesn’t necessarily lend itself to understanding quickly what’s going on:

JOB RUN:             ‘Meaningless Job Name’ was run on 25/04/2014 at 14:26:11
DURATION:        0 hours, 0 minutes, 10 seconds
STATUS:               Failed
MESSAGES:        The job failed.  The Job was invoked by User TSG\aUser.  The last step to run was step 1 (Meaningless Step Name).

I wanted to be able to send a much more helpful email message whenever a DBCC CHECKDB failed. In particular I wanted a message that was:

  • Entirely custom  so that I could have it jump out at the receiver as being unusually urgent. The risk in our environment is that with so many emails flying about – something like this could be ignored with disastrous consequences
  • Able to offer guidance on what the to do if you receive the email. This is because the receiver may well not be a SQL Server specialist and may not know how to deal with an urgent corruption issue
  • Able to contain rich formatting to highlight important information

The resulting email is shown below, but you could get much fancier with your formatting if you wish:

Email on DBCC Failure

I haven’t tested the code extensively yet but we are trialling it in production without issue at the moment. I thought I would share the code in case anyone would find it useful. There are a variety of very clever DBCC CHECKDB scripts out there already, but none of the ones I found were written to allow the sending of a custom email body.

Instructions are in line. I’d recommend you run this as an SQL Agent job and have the job also email you on failure to protect against any bugs in the code.


-- Overview
-- Performs a full DBCC CHECKDB on all databases specified below and will email any failures to the email address specified
-- How to use
-- 1. Ensure the mail profile is configured in SQL Server
-- 2. Provide a valid "to" email address
-- 3. Add the databases to check to the @databasesToCheck table variable
-- 4. Schedule this script to run during an appropriate maintenance window

-- Future Improvement
-- Log the output of DBCC to a semi-permenant location to avoid the need to re-run it after the error is detected. This would be important in
-- very large db setups or those with very high availability requirement

-- SETUP STEP
-- Ensure the following Database Mail profile is configured and an appropriate email address specified
DECLARE @mailProfileName VARCHAR(250) = '<INSERT MAIL PROFILE NAME>'
DECLARE @toEmailAddresses VARCHAR(500) = '<INSERT THE TO EMAIL ADDRESS>'

DECLARE @databasesToCheck TABLE (
	ID INT IDENTITY (1, 1),				-- This allows use to select and process databases in a specified order
	databaseName VARCHAR(250)
)

-- SETUP STEP
-- Add a row for each database you wish to check
INSERT INTO @databasesToCheck
VALUES
('DATABASE TO CHECK 1'),
('DATABASE TO CHECK 2'),

BEGIN TRY
	DECLARE @dynamicSQL NVARCHAR(500)
	DECLARE @currentDatabaseName VARCHAR(250)

	-- Define a cursor to allow us to process each database in turn
	DECLARE curDatabases CURSOR FOR
	SELECT databaseName
	FROM @databasesToCheck
	ORDER BY ID ASC

	OPEN curDatabases

	FETCH NEXT FROM curDatabases
	INTO @currentDatabaseName

	-- Loop through each database we want to check
	WHILE @@FETCH_STATUS = 0
		BEGIN

			DECLARE @currentProcessingResult INT

			-- The following statements are really for the benefit of showing something useful in the job history window:
			PRINT ('Processing database ' + @currentDatabaseName)

			-- Not ideal but the only way I know of to dynamically build adhoc DBCC commands
			-- NOTE: We surpress info messages when things are going well. We only care when things go wrong
			SET @dynamicSQL = 'DBCC CHECKDB ([' + @currentDatabaseName + ']) WITH NO_INFOMSGS'

			PRINT ('  Executing DBCC CHECKDB on target database ')

			-- NOTE: Running a straight DBCC CHECKDB will not raise an exception even if corruption is found. The only way to determine if failure occured
			-- is to use the return value and determine if we need to raise our own exception. See here for further details:
			-- http://support.microsoft.com/kb/953503
			EXEC @currentProcessingResult = sp_executesql @dynamicSQL

			-- If the last run detected an error, raise an error and stop
			IF @currentProcessingResult <> 0
				BEGIN
					-- Weird limitation in RAISERROR - it can't take a string concatentation so we need to do it like this
					DECLARE @errorMessage VARCHAR (1000) = 'DBCC CHECK DB failed for database ' + @currentDatabaseName
					RAISERROR (@errorMessage, 11, 1)
				END

			PRINT ('  DBCC CHECKDB completed successfully with no errors on database ' + @currentDatabaseName)

			FETCH NEXT FROM curDatabases
			INTO @currentDatabaseName

		END

	CLOSE curDatabases
	DEALLOCATE curDatabases

END TRY

BEGIN CATCH
	DECLARE @emailBody NVARCHAR(4000)
	DECLARE @catchErrorMessage VARCHAR (1000) = ERROR_MESSAGE()

	PRINT('Critical Error Detected: ' + @catchErrorMessage)

	-- Generate a formatted email using HTML. Trying to format in plain text is a bit of a pain. HTML allows you to create a template in Visual Studio
	-- and pretty much copy it in here verbatim
	SET @emailBody = N'
	        <style type="text/css">
                #divContent {
                    font-family: Tahoma, Arial, Verdana, sans-serif;
                    font-size: 0.8em;
                }
            </style>

        <div id="divContent">

		    <p><strong>Warning: A potentially serious issue has been detected during overnight corruption checks on SQL Server. Please do not ignore this message as our ability to recover data may be effected</strong>
            </p>

            <p>The error message provided by the agent job was <em>"' + @catchErrorMessage + '"</em> </p>' +

            '<p><strong>Why has this message been sent?</strong>
                <br />
                This message has been sent because corruption has been detected in the following database: <br />
            </p>

            <ul><li><strong>' + @currentDatabaseName + '</strong> </li></ul>

            <p>Please note that the overnight DBCC CHECKDB process will have stopped once the first database corruption was detected. Other databases may also be corrupt and should be checked manually</p>
            <p>
                <strong>Why is it important that we respond quickly?</strong> <br />
                If one of the live databases has been corrupted, the best chance of recovering data is via a previous backup. Because backups are only retained for a short period of time it is essential that we isolate the backup most likely to help us recover as quickly as possible <br />
            </p>

            <p>
                <strong>What should happen next?</strong>
            </p>

            <ul>
                <li>Run DBCC CHECKDB on the problem database again to retrieve the full details of the corruption. This may take a significant amount of time to complete so consider saving the results.</li>
                <li>Isolate all backups from the affected database as they may be needed to effect a repair</li>
                <li>Contact Microsoft Technical Support to get their assistance with the repair operation</li>
                <li><strong>DO NOT</strong> run DBCC CHECKDB with "repair_allow_data_loss" before speaking to Microsoft first! This may cause data loss unnecessarily and will quite possibly make the situation worse</li>
                <li>If possible find a location to restore at least one backup to</li>
                <li>Investigate the underlying cause of the corruption. It is important to understand why the corruption occured before it occurs again</li>
            </ul>

                 </div>
            '

	EXEC msdb.dbo.sp_send_dbmail
		@profile_name = @mailProfileName,
		@recipients = @toEmailAddresses,
		@body_format = 'HTML',
		@subject = 'WARNING: A potentially serious issue has been detected during overnight database consistency checks',
		@body = @emailBody

	-- This is minging but depending on how the code above fails, we could still have an open cursor at this point.
	-- Close it and ignore the possibility that its already closed
	BEGIN TRY
        --clean it up
		CLOSE curDatabases
		DEALLOCATE curDatabases
    END TRY
    BEGIN CATCH
        -- Would happen if the cursor is already closed. Wishing TSQL had a TRY CATCH FINALLY construct here
    END CATCH

	-- Raise an additional exception to cause any calling job to also fail. This will increase the chance of detection if there is a problem
	-- with database mail
	RAISERROR (@catchErrorMessage, 11, 1)

END CATCH

SQL Server

The only Oracle feature I actually still want

Introduction

I really like SQL Server. Early in my career I had more than one opportunity to take the Oracle path, but there was always something about SQL Server’s ease of use and “good enough” approach to performance and high availability that I liked. That and the fact that most of our customers didn’t have 250K to spend on a “basic” database setup pretty much settled matters.

Although it has become increasingly rare, I am very occasionally hit by an acute episode of database envy.

As of SQL Server 2014 I think there is really only one Oracle feature that I would absolutely love in SQL Server. There are other features I think are cool (genuine scale-out clustering for example). But this the only feature I want that could seriously make me more relaxed about using SQL Server in high availability setups.

Oracle Flashback Technology

Oracle Flashback Technology is best demonstrated via some very simple code:


SELECT * FROM Orders
AS OF TIMESTAMP        -- SAY WHAT!!!!
TO_TIMESTAMP('2014-04-23 09:30:00', 'YYYY-MM-DD HH:MI:SS')

 

Now – allow that code to just detonate in your brain for a second… It’s allowing you to query the live system as it was at some point in the past.

In day to day development use that code could probably be described as “quite handy”.

In a VLDB, 24/7 environment that code is absolute feature dynamite.

Why does this matter?

To picture why this feature is bordering on essential in a VLDB 24/7 environment, imagine what you would be faced with doing in the following scenario:

  • You have a 1 Terrabyte OLTP system that needs to be available 24/7
  • A power user or developer advises you that they have accidentally modified a substantial number of some extremely important records
  • Your primary 24/7 system is now offline or at best severely impaired

What are your options in SQL Server land?

  • You have a highly available system using some clever combination of clustering, synchronous mirroring and maybe even asynchronous mirroring to a secondary data centre. However, by the time you find out about the problem the corruption has already been sitting replicated at the secondary location for half an hour – so you’re bang out of luck there.
  • You have a policy of using database snapshots when applying critical database updates internally. However, this was not a scheduled update. It was a power user gone rogue and so there is no useful snapshot that can be used to help extricate us from the situation.
  • You are backed up to the hilt and could theoretically identify the exact moment the corruption occurred. You could even theoretically recover to just before that point. But it’s a 24/7 system – you can’t just stop the business whilst you perform a restore over the live database.
  • You could recover to a secondary location and repair the live system by hand using  T-SQL. But, we are dealing with 1000 gigabytes of data. Do you have a terrabyte of SQL Server storage sitting around for just this eventuality? Assuming you do, do you have an abundance of time on your hands where you can sit back and wait for a restore to complete? It’s going to take many, many hours so I hope the answer is yes!

The bottom line is that SQL Server does not have any good options for this sort of scenario.

Off the top of my head, to have any sort of solution at all to this problem you could look at something like log-shipping-with-delay to give yourself at least a fighting chance. I’ve also heard of people taking hourly snapshots using schedules and scripting, just in case someone deletes something. That doesn’t make me very happy either.

Unfortunately the options available do not even come close to the simplicity and immediacy of Oracle’s Flashback feature.

I don’t know the particulars of how Flashback is implemented in Oracle. But however its done its smart as hell. It’s also the one remaining feature I secretly covet for SQL Server and I really hope MS do something similar one day.