It’s not the sexiest topic in the world, but the Page Verify setting on your most important databases might save your bacon one day. It’s probably one of those settings that you would recognise but would likely breeze right past without a second glance:
The value directly effects how quickly SQL Server will detect a corruption of your database for you and trust me – you want that to be as close to real time as possible.
The longer corruption is present in your database the harder it can be to repair. Perhaps more importantly – you need to figure out where the corruption is coming from in case you have a pretty nasty problem on your hands.
In this post I’m going to briefly explain what Page Verify is about, and hopefully convince you that you need to pay attention to it – particularly if you’ve upgraded a server instance recently.
PAGE VERIFY has three options that control the lengths SQL Server will go to to verify the integrity of your data:
- NONE– Don’t use this one. Like seriously dude, just don’t
- TORN_PAGE_DETECTION – You might be on this without realising
- CHECKSUM – The one you want under nearly all circumstances
Torn Page Detection
Torn page detection helps with the question, “Did my entire page make it to disk?”
SQL Server has had the ability to detect certain types of page corruptions since at least SQL Server 2000. In SQL Server 2000 the concept of torn page detection was introduced with the intention of allowing SQL Server to detect when an 8K data page was only partially written out.
The key scenario here would be if SQL Server sent a page to the IO subsystem but a power cut or perhaps a disk failure resulted in the page only half making it to disk. Clearly this is something we would want to know about as quickly as possible.
The way torn page detection works is extremely simple and is a very lightweight process – so back in SQL Server 2000 days it should have almost always been switched on.
A single 8K data page is actually a series of 16 X 512 byte disk sectors. As a page is written out of the buffer pool, each disk sector is assigned a very simple alternating 2 bit pattern similar to the following:
I don’t know if the pattern is as simple as alternating 00/11 pairs but you hopefully get the idea. If we only get half way through writing an 8K page we should be able to figure it out by checking that the actual bit pattern matches the expected pattern. After recovering from our power cut we will be able to identify the (hopefully) small number of pages that tragically didn’t make it to disk cleanly.
This is clearly a lot better than suffering a power cut and not actually being sure if your data made it to disk.
If torn page detection helps answer the question “Did my entire page make it to disk?”, the checksum option helps answer the question “Do the contents of the page still look right”?
Unfortunately, even when data is written to disk successfully there are still a range of corruption scenarios we can fall victim to. And because torn page detection doesn’t really concern itself with the contents of the data page – it won’t help with these situations. In fact in many cases it would be entirely oblivious to the problem.
In order to make sure the contents of a page are still sensible we need to use a broader integrity mechanism – one that actually sense checks the contents of the pages directly. This is where the third of our Page Verify options comes in – CHECKSUM.
With the checksum option selected, each time the buffer pool flushes a page to disk the contents are examined and a checksum is added to the page header. The next time the page is read back in, the contents are checked, the checksum re-evaluated and we can immediately tell if our page contents seem right.
This protects us from situations where the page was output correctly and would pass a torn page check – but where something else happened sometime later that damaged our data. This could be anything from failing hardware, a faulty RAID driver or some errant process than came and crapped up our data after it was successfully written.
Because CHECKSUM verifies the contents of our pages each time we interact with them, it can detect a far broader range of corruptions. This is very important because the sooner we detect corruption the better, for two reasons:
- We need to find the root cause of the corruption before even more of our database gets borked
- The quicker we find we have a problem, the easier a recovery is going to be using backups or some form of page repair
The Upgrade Problem
So I hope I’ve convinced you that under nearly all day to day scenarios we want CHECKSUM to be the option we’re using. The good news is that this is the default option for all databases created since 2005.
The problem we have is where we have a database that was at one point running on a version of SQL Server prior to 2005. In this case you might be surprised to check the setting and see this:
This is surprisingly common particularly in larger businesses so if you have recently inherited ownership of some databases and your not sure of their full life history – do yourself a favour and check the Page Verify setting.
Once additional thing to be aware of. Because checksums are generated and written only when a page is read into memory and then written back out, it is not enough to change the setting and assume you are now protected. There is sadly no process that will squirrel around in the background updating all the pages in your database with correct checksums.
The recommended way to address this problem is actually just to make sure you have a decent index rebuild plan in place. The next time you do a rebuild, the operation will have the effect of writting each data page out and this will stamp the page with the desired checksum value
Hope this helps