If you’ve been into your SQL Server performance tuning for a while you’ve probably come across the stock advice that if your query returns too many rows, your indexes may not be used. Although its sometimes skimmed over – this logic generally only applies to non-covering, non-clustered indexes.
Before long various stats are flying around, usually in the form of a quote like:
“If your query returns more than X% of the rows in a table, your index won’t be used”
Where X = 1%, 3%, 5%, 10%, 15%, 30% or any other number that seems to be plucked out of the ether somewhere.
But what if I told you something even more outlandish than that 1% figure. What if I told you the number could be as low as ~0.15% of a table before your non-clustered index becomes effectively useless?
I don’t know about you but that seems fairly improbable to me. We’re basically saying that even though the index could be used to eliminate 99.85% of the table from consideration in the blink of an eye, we’re going to scan the entire table anyway. What the what now!?
Don’t take my word for it though. Lets take a look at a couple of examples and then I’ll explain why this makes sense.
Scenario 1: Small Rows
Lets create a sample database and see if we can recreate this scenario:
USE Master GO IF DATABASEPROPERTY('Selectivity', 'Version') > 0 DROP DATABASE Selectivity; GO CREATE DATABASE Selectivity GO USE Selectivity GO CREATE TABLE tblData( ID INT IDENTITY(1, 1), Data VARCHAR(50), -- NOTE: We are deliberately creating small rows to fit as many individual rows on a page as possible NonClusteredColumn INT )
This database and table structure is unremarkable in all respects except one. The rows are very small – unusually so in fact, but not beyond the realms of possibility. Even if this table were pressing 1 million+ rows, its not going to take up very much room and as we’ll see later, this is quite important.
Lets hoof some rows in there and build both a clustered and non-clustered index on two different columns. Remember we’re going to be basing our queries on the non-clustered index as the behaviour of the clustered index will be quite different.
-- Disable notifications of row inserts for a small performance boost SET NOCOUNT ON GO DECLARE @counter AS INT = 0 BEGIN TRAN WHILE @counter < 1000000 BEGIN INSERT INTO tblData (Data, NonClusteredColumn) VALUES ('Data', @counter + 1) SET @counter = @counter + 1 END COMMIT GO -- Create indexes once the table is built. This should gaurantee that our statistics are accurate CREATE CLUSTERED INDEX PK_tblData ON tblData(ID) GO CREATE NONCLUSTERED INDEX IX_tblData_NonClusteredColumn ON tblData(NonClusteredColumn) GO
Again nothing remarkable there except the fact that we build the indexes at the end simply to make sure that the index statistics will be up to date.
So – on with the testing. Give the following query a go:
SELECT * FROM tblData WHERE tblData.NonClusteredColumn > 998485
That query will return 0.151% of the rows in the table and will cause a non-clustered index seek. I don’t know about you, but that’s exactly what I was expecting given the search predicate is perfectly suited to using the non-clustered index.
So lets modify the query to return just 1 more row:
SELECT * FROM tblData WHERE tblData.NonClusteredColumn > 998484
And the result is a clustered index scan:
In other words, once we started returning 0.156% of the rows in the table, SQL Server went “bugger it” and started scanning the whole table. Even though it could have used the Non-clustered index to eliminate 99.844% of the table at a stroke.
Now, just to prove that this was a deliberate decision on the part of SQL Server, take a look at the row estimates used by the plan:
The Estimated Number of Rows entry for the plan proves that SQL Server knew exactly what it was doing when it chose to scan the table. Those row statistics could have barely been any better and proves that SQL Server did not get confused or was in some way mislead by its statistics as to the nature of what was in the table.
So what accounts for that behaviour? Lets take a look at another example, this time with more typical rows sizes.
Scenario 2: Large Rows
I’m not going to give you a blow by blow description this time. Its the same setup as before, the one difference being we have artificially set each row to be substantially larger than in the first scenario. This will mean that fewer rows will fit on each 8K data page and the table will be physically much larger than before:
USE Master GO IF DATABASEPROPERTY('Selectivity', 'Version') > 0 DROP DATABASE Selectivity; GO CREATE DATABASE Selectivity GO USE Selectivity GO CREATE TABLE tblData( ID INT IDENTITY(1, 1), Data CHAR(2000), -- NOTE: Using a largeish CHAR data type to simulate larger rows. This table will be substantially larger than before NonClusteredColumn INT ) -- Diable notifications of row inserts for a small performance boost SET NOCOUNT ON GO DECLARE @counter AS INT = 0 -- Use an explict transaction to speed the individual inserts up BEGIN TRAN WHILE @counter < 1000000 BEGIN INSERT INTO tblData (Data, NonClusteredColumn) VALUES ('Data', @counter + 1) SET @counter = @counter + 1 END COMMIT GO -- Create indexes once the table is built. This should guarantee that our statistics are accurate CREATE CLUSTERED INDEX PK_tblData ON tblData(ID) GO CREATE NONCLUSTERED INDEX IX_tblData_NonClusteredColumn ON tblData(NonClusteredColumn) GO
With the larger rows, we actually get up to around the 6.5% mark before we will be switched over to a clustered index scan as you can see below:
Flipping over to a scan around the 6.5% mark:
So that’s a little more in keeping with the conventional advice of around the 5% mark.
But what causes the difference?
Unlike some SQL Server query plan behaviour – this situation is really easy to explain. Take a look at the following screen shot of how big the “small” 1 million row table is:
Even though it holds 1000000 rows – its still pretty small in memory terms. So small in fact that it would fit fairly comfortably in memory and at 26MB, SQL Server would be through that in a flash. No clumsy bookmark lookups from the separate non-clustered index. No fuss.
Now lets take a look at the larger table:
Hmmm. That’s the exact same 1000000 rows, but in memory terms, the table is much larger.
SQL Server could still very well hold 2GB of table in memory – but it doesn’t take a genius to realise that scanning 2GB of data to get a small percentage of rows is a different proposition when compared to scanning 26MB.
That’s not to say that SQL Server won’t scan the whole 2GB – it will (once we reach the 6.5% rows returned mark). But Microsoft have clearly set the bar a bit higher with larger (and perhaps more typical) tables.
It would be nice if there was some sort of simple formula that I could share with you to tell you what the percentages are at different table sizes. If there is such a formula I certainly don’t know what it is and it would likely be pretty complicated given the number of factors the Query Optimiser needs to take account of. It almost certainly considers far more than just how big the table is or how many pages it contains.
The objective of this post was really just to point out that on small tables, your non-clustered indexes may well be useless even with relatively high row counts. At first sight this seems pretty counter-intuitive. But when you consider what is involved in doing a straight read of say, a 26MB table, versus having to jump in and out of a separate non-clustered index structure it begins to make sense.
The thing to understand is that small tables behave very differently to large tables in this respect and the only way to know what the exact percentage is, is to try it yourself.
Update: Gail Shaw provided some additional info on the formula that SQL Server uses to flip between scan or seek in the comments below