Tuesday 25 May 2010

Date Columns at the end of Indexes

A generally good rule of thumb is that date columns should appear at the end of a multi-column index and not at the start or the middle. Why? Because generally date columns are used in range comparisons in queries a lot more often than equality comparisons. The effect of putting such a date column in the middle of an index is to distribute values of the following columns across the different dates within the index.

What this means is that if you have an index on 3 columns - A, D, Z say - where D is a date and A and Z are not (either numeric or a character string) and you execute a query constraining on all 3 columns but with a range constraint on D, the database server will have to read and check a great many index entries

select ... from T where A = 123 and D > to_date ('20090101', 'YYYYMMDD') and Z = 456 ;

An index is a tree like structure organized by the columns in it. So at the highest levels we branch out for different values of A. Then lower down, for each value of A it branches out for each value of D within that particular value of A. And underneath that we have each value of Z that occurs for each value of D. The value 456 of Z may only occur a few times in the whole table, but it could occur on any value date of D, and so could be in different parts of the index.

This means that when executing the query, the database server will traverse the index, find the branch for A with the value 123 and then all the values for D underneath it. Due to the date range constraint, it will have to check many different values of D to see which ones also have Z with a value of 456.

If the index was on A, Z, D instead, then fewer index entries would need to be checked when executing the query. First it would traverse the index to 123 for A. Then it would traverse down the next level to 456 for Z. Then it would traverse down the next level to D and all values greater than the specified date (1/1/2009). At this point all such entries in the index match the 3 constraints, and so can be retrieved.

This alternate index leads to far fewer index entries being read and checked when executing this particular query.

I am not saying that dates should always go at the end of indexes. If Z was not constrained in the query, only A and D, then the second index would not be much use, and the first index would be much better. There may also be cases where dates are used with equality constraints rather than ranges. But the general rule is still a good one. Put date columns at the end of an index, unless you know the queries being executed would benefit from that date column appearing earlier in the index. As ever the best indexes are those that are most useful to the queries you execute and cover the referenced columns.

An example of this I have just come across resulted in a query execution time coming down from 10,000 ms (10 seconds) by an order of magnitude to 20 ms - a major improvement I think you'll agree. To me this just shows the greater "efficiency" of the index - far fewer index blocks need to be visited to satisfy the query, and those fewer blocks are more likely to be cached in memory too. Hence the orders of magnitude reduction in the elapsed time of the query.

Furthermore this query is executed in a loop some 200+ times within the application I was looking at. So the net effect is not just 10 seconds once down to less than 1 second once, but really 2,000 seconds or 33 minutes down to less than 33 seconds across the overall job each day it is run. A worthwhile improvement from having an alternate index with the date column at the end of it.