Disaster Recovery, High Availability, Recovery, SQL Server

High Availability != Disaster Recovery – SQL Server Edition

Its quite common to hear even seasoned IT pro’s use the terms High Availability and Disaster Recovery as though they are synonymous. With SQL Server in particular I think part of the confusion stems from the fact that many of the technologies involved can be employed in both areas with varying degrees of success. Further complicating matters is the fact that some of these technologies can also be used in areas such as scale out and reporting that have nothing to do with HA/DR at all.

High Availability

High Availability is the art of ensuring a system has as close to 100% uptime as is possible under normal operating conditions.

The main strategy for achieving high levels of uptime is redundancy – having 2 or more of everything. 2 or more servers, 2 or more NICs, 2 or more power supplies, 2 or more switches…. You get the idea.

The other part of this redundnacy strategy is having components, servers and software that can fail over to the redundant equipment as quickly and transparently as possible.

High Availability “events”, such as a disk or NIC failure will normally be much more common than a Disaster Recovery event. They are so common in fact that they should be considered routine in even a small-medium sized business. Disks, power supplies, switches and even entire servers fail every day but with the correct levels of redundancy users should normally be completely unaware of the failure.

Disaster Recovery

Disaster Recovery on the other hand is concerned with handling the truly extraordinary, and unless you have spent some serious coinage – it is very unlikely to be painless.

Disaster Recovery events should be the complete opposite of High Availability events. Whereas power supply, switches and even whole servers failing should be managed as routine day to day events, Disaster Recovery situations should be exceptionally rare.

In my experience Disaster Recovery has always fallen into one of two scenarios:

Scenario 1: Serious Data Loss or Corruption

This can occur through some sort of horrendous user error or some sort of system failure. Power cuts/spikes in particular seem to have a nasty habit of borking SQL Server databases. Either that or I’m just unlucky.

Scenario 2: The “Godzilla” Event

A Godzilla event is where a significant amount of equipment or even an entire site has been put out of action for a prolonged period.

Examples here would be fire, flooding, an air conditioning unit leaking fluid onto your only server cabinet or workmen cutting through a key piece of fibre. A giant rampaging lizard would also fall into this category.

Massive data loss and rampaging lizards are both most definitely disasters, but will likely require different types of response.

In the case of serious data loss or corruption we are probably looking at some sort of recovery from backups. Thankfully your primary site is still in action so total recovery time will be a function of how quickly you can find, transfer and restore from backup. There can be all sorts of issues here in practice including remembering how to perform the restore, the correct recovery point to go to and storage capacity issues whilst you do the recovery.

If you want to mitigate against the “Godzilla” event you basically need to be in the business of being able to fail key systems over to a completely different geographic site. As you might imagine the cost of this can be significant in terms of hardware, WAN links, software and technical skills. The quicker and more transparently you want this to occur, the harder it becomes.

Which technology sits where?

Below is a simple list of SQL Server and associated technologies and where I think they sit in the HA/DR landscape. Notice that some technologies sit quite comfortably in both camps, hence the confusion described earlier.

Synchronous technologies are better placed in the HA camp and Async in the DR camp. That said though – there are no hard and fast rules and I’ve seen Synchronous mirroring and replication used for DR more than once.

High Availability

Disaster Recovery

Redundant Hardware and Comms
SQL Server Failover Clustering
Automatic Page Repair
Server and Database Backups
Log Shipping

Both

Sync/Async Database Mirroring
Sync/Async SQL Server Replication
Sync/Async AlwaysOn Availability Groups
Sync/Async SAN/VM Replication

Let me know if you think I’ve missed a key technology or if you think I have any tech in the wrong place!

Cheers

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s