Performance, SQL Server

SQL Server: Are your non-clustered indexes useless?

Introduction

If you’ve been into your SQL Server performance tuning for a while you’ve probably come across the stock advice that if your query returns too many rows, your indexes may not be used. Although its sometimes skimmed over – this logic generally only applies to non-covering, non-clustered indexes.

Before long various stats are flying around, usually in the form of a quote like:

“If your query returns more than X% of the rows in a table, your index won’t be used”

Where X = 1%, 3%, 5%, 10%, 15%, 30% or any other number that seems to be plucked out of the ether somewhere.

But what if I told you something even more outlandish than that 1% figure. What if I told you the number could be as low as ~0.15% of a table before your non-clustered index becomes effectively useless?

I don’t know about you but that seems fairly improbable to me. We’re basically saying that even though the index could be used to eliminate 99.85% of the table from consideration in the blink of an eye, we’re going to scan the entire table anyway. What the what now!?

Don’t take my word for it though. Lets take a look at a couple of examples and then I’ll explain why this makes sense.

Scenario 1: Small Rows

Lets create a sample database and see if we can recreate this scenario:

USE Master

GO

IF DATABASEPROPERTY('Selectivity', 'Version') > 0
    DROP DATABASE Selectivity;

GO

CREATE DATABASE Selectivity

GO

USE Selectivity

GO

CREATE TABLE tblData(
	ID INT IDENTITY(1, 1),
	Data VARCHAR(50),					-- NOTE: We are deliberately creating small rows to fit as many individual rows on a page as possible
	NonClusteredColumn INT
)

This database and table structure is unremarkable in all respects except one. The rows are very small – unusually so in fact, but not beyond the realms of possibility. Even if this table were pressing 1 million+ rows, its not going to take up very much room and as we’ll see later, this is quite important.

Lets hoof some rows in there and build both a clustered and non-clustered index on two different columns. Remember we’re going to be basing our queries on the non-clustered index as the behaviour of the clustered index will be quite different.


-- Disable notifications of row inserts for a small performance boost
SET NOCOUNT ON 

GO

DECLARE @counter AS INT = 0

BEGIN TRAN

	WHILE @counter < 1000000
		BEGIN

			INSERT INTO tblData (Data, NonClusteredColumn)
			VALUES ('Data', @counter + 1)

			SET @counter = @counter + 1
		END

COMMIT

GO
-- Create indexes once the table is built. This should gaurantee that our statistics are accurate
CREATE CLUSTERED INDEX PK_tblData ON tblData(ID)
GO
CREATE NONCLUSTERED INDEX IX_tblData_NonClusteredColumn ON tblData(NonClusteredColumn)
GO

Again nothing remarkable there except the fact that we build the indexes at the end simply to make sure that the index statistics will be up to date.

So – on with the testing. Give the following query a go:


SELECT * FROM tblData
WHERE tblData.NonClusteredColumn > 998485

Small Rows Seek

That query will return 0.151% of the rows in the table and will cause a non-clustered index seek. I don’t know about you, but that’s exactly what I was expecting given the search predicate is perfectly suited to using the non-clustered index.

So lets modify the query to return just 1 more row:


SELECT * FROM tblData
WHERE tblData.NonClusteredColumn > 998484

And the result is a clustered index scan:

Small Rows Scan

In other words, once we started returning 0.156% of the rows in the table, SQL Server went “bugger it” and started scanning the whole table. Even though it could have used the Non-clustered index to eliminate 99.844% of the table at a stroke.

Now, just to prove that this was a deliberate decision on the part of SQL Server, take a look at the row estimates used by the plan:

Accurate Statistics

The Estimated Number of Rows entry for the plan proves that SQL Server knew exactly what it was doing when it chose to scan the table. Those row statistics could have barely been any better and proves that SQL Server did not get confused or was in some way mislead by its statistics as to the nature of what was in the table.

So what accounts for that behaviour? Lets take a look at another example, this time with more typical rows sizes.

Scenario 2: Large Rows

I’m not going to give you a blow by blow description this time. Its the same setup as before, the one difference being we have artificially set each row to be substantially larger than in the first scenario. This will mean that fewer rows will fit on each 8K data page and the table will be physically much larger than before:

USE Master

GO

IF DATABASEPROPERTY('Selectivity', 'Version') > 0
    DROP DATABASE Selectivity;

GO

CREATE DATABASE Selectivity

GO

USE Selectivity

GO

CREATE TABLE tblData(
	ID INT IDENTITY(1, 1),
	Data CHAR(2000),					-- NOTE: Using a largeish CHAR data type to simulate larger rows. This table will be substantially larger than before
	NonClusteredColumn INT
)

-- Diable notifications of row inserts for a small performance boost
SET NOCOUNT ON 

GO

DECLARE @counter AS INT = 0

-- Use an explict transaction to speed the individual inserts up

BEGIN TRAN

	WHILE @counter < 1000000
		BEGIN

			INSERT INTO tblData (Data, NonClusteredColumn)
			VALUES ('Data', @counter + 1)

			SET @counter = @counter + 1
		END

COMMIT

GO
-- Create indexes once the table is built. This should guarantee that our statistics are accurate
CREATE CLUSTERED INDEX PK_tblData ON tblData(ID)
GO
CREATE NONCLUSTERED INDEX IX_tblData_NonClusteredColumn ON tblData(NonClusteredColumn)
GO

With the larger rows, we actually get up to around the 6.5% mark before we will be switched over to a clustered index scan as you can see below:

Large Rows Seek

Flipping over to a scan around the 6.5% mark:

Large Rows Scan

So that’s a little more in keeping with the conventional advice of around the 5% mark.

But what causes the difference?

Unlike some SQL Server query plan behaviour – this situation is really easy to explain. Take a look at the following screen shot of how big the “small” 1 million row table is:

Small table size

Even though it holds 1000000 rows – its still pretty small in memory terms. So small in fact that it would fit fairly comfortably in memory and at 26MB, SQL Server would be through that in a flash. No clumsy bookmark lookups from the separate non-clustered index. No fuss.

Now lets take a look at the larger table:

Large Rows

Hmmm. That’s the exact same 1000000 rows, but in memory terms, the table is much larger.

SQL Server could still very well hold 2GB of table in memory – but it doesn’t take a genius to realise that scanning 2GB of data to get a small percentage of rows is a different proposition when compared to scanning 26MB.

That’s not to say that SQL Server won’t scan the whole 2GB – it will (once we reach the 6.5% rows returned mark). But Microsoft have clearly set the bar a bit higher with larger (and perhaps more typical) tables.

Conclusion

It would be nice if there was some sort of simple formula that I could share with you to tell you what the percentages are at different table sizes. If there is such a formula I certainly don’t know what it is and it would likely be pretty complicated given the number of factors the Query Optimiser needs to take account of. It almost certainly considers far more than just how big the table is or how many pages it contains.

The objective of this post was really just to point out that on small tables, your non-clustered indexes may well be useless  even with relatively high row counts. At first sight this seems pretty counter-intuitive. But when you consider what is involved in doing a straight read of say, a 26MB table, versus having to jump in and out of a separate non-clustered index structure it begins to make sense.

The thing to understand is that small tables behave very differently to large tables in this respect and the only way to know what the exact percentage is, is to try it yourself.

Update: Gail Shaw provided some additional info on the formula that SQL Server uses to flip between scan or seek in the comments below

Advertisements

3 thoughts on “SQL Server: Are your non-clustered indexes useless?

  1. Minor comment, your conclusion about it being related to memory size and amount of data needing scanning is wrong.

    It’s got nothing to do with memory sizes or table sizes. It’s got to do with the calculated cost of the key lookups. Each key lookup will read at least one page, more commonly 2 or more (they’re singleton seeks against the clustered index). These individual seeks are expensive and there’s a point where a read of the entire clustered index is more efficient by the query optimiser’s calculations than a large number of key lookups.

    There is a formula, it’s the ratio of rows being returned by the seek to the number of pages in the table, which is why small rows tip at much lower % than large rows.

    Doesn’t go into much detail, but: http://sqlinthewild.co.za/index.php/2009/01/09/seek-or-scan/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s