Wednesday 27 May 2009

Optimizer Costing of Index Scans

Previously I gave a brief overview of the fact that the optimizer costs each potential execution plan for a SQL query and then chooses the one with the lowest cost, and described how it costs a Full Table Scan. Now I will carry on and look at the cost calculations for using an Index to access data in a table. Again, most of this is based on the information in A Look Under The Hood Of CBO: The 10053 Event by Wolfgang Breitling, and various tests I have done to confirm the details of this. Now we will look at the calculation of costs for table access using an Index.

Note that I am concentrating here on the I/O cost as that is the largest component of the cost of an execution plan. However, there is also a CPU cost, which is not shown here. Typically this is far less than the I/O cost, and is so in the cases discussed so far. It is possible for some execution plans that the CPU cost is significant, and this will increase the total cost of such an execution plan.

Just as Oracle can scan all blocks in a table, so it can scan all leaf blocks in an index. This scan can be performed in two different ways potentially - as a scan using single block reads, or a scan using multi-block reads as for a table scan. The former is termed an Index Full Scan, while the latter is termed an Index Fast Full Scan. The cost calculations of each are based on the number of Leaf Blocks in that particular index.

For a Full Scan the cost formula seems to be:
  • #Levels + #Leaf Blocks
For a Fast Full Scan the cost formula uses MBRC as for a Full Table Scan:
  • #Levels + (#Leaf Blocks * MREADTIM / (MBRC * SREADTIM))
Testing with different queries and different indexes shows this to be the case.

Note that I have seen Oracle report both access methods as an "INDEX FAST FULL SCAN" operation in an EXPLAIN PLAN (DBMS_STATS.DISPLAY_CURSOR). I wonder whether this is a bug within the labelling of operations within the Plan Table used?

Normally however indexes are used to implement a predicate condition, and restrict the number of rows being returned. To do this the index is traversed from the root block down through the levels of the index to the leaf blocks that contain the matching values. Not all leaf blocks will be visited - only those containing a matching entry for the predicate condition. The Oracle Optimizer terms this restriction on the number of matching entries the Filter Factor. It is the proportion of rows in the whole table that will match the predicate condition i.e. a value always less than 1. Thus the cost of using only the index itself is:
  • #Levels + (#Leaf Blocks * Filter Factor)
In many situations some data needed for the query is not present in the index, and so the data row needs to be read from the table itself. Thus the cost increases by the cost of reading these data rows, which is essentially the number of rows matching the predicate condition:
  • #Rows * Filter Factor
There are however several refinements which need to be made to these formulae to be reflect how the Optimizer actually costs the access paths.

First, it is possible that matching data rows are stored in the same data block. In such cases the data block is only read once, even though multiple data rows are used from it. Such clustering of similar values will reduce the number of data blocks that need to be read i.e. the read of one data block from disk can satisfy multiple entries from the same leaf block. Oracle calculates this clustering as a statistic for each index, and uses this to report the number of data blocks that would need to be read – the Clustering Factor. This is visible in the CLUSTERING_FACTOR column in the USER_INDEXES and USER_IND_STATISTICS data dictionary views.

For many indexes it will be the case that the Clustering Factor is the same value as the number of Rows in the table – no co-location of data values for that particular column. But for some indexes, the Clustering Factor will be much lower, such as an index on the date the data was loaded, or an incrementing sequential account number. The Clustering Factor Statistic factors in the benefits of the reduced disk I/O from this co-location of Index and Data entries, to better estimate the real disk I/O that would actually take place.

The final full index cost formula is thus:
  • #Levels + (#Leaf Blocks * Filter Factor) + (Clustering Factor * Filter Factor)
Second the Filter Factor is calculated in different ways for different types of predicate conditions within a query. For now I will restrict myself to equality conditions (=), as they are the most common. Please read Wolfgang's Under The Hood paper for information on the Filter Factor for other conditions.

For a column with no Histograms (this is important), the Filter Factor for an equality condition will be the density of data values within that column - a measure of the proportion of data rows that have the same value for that column. This is true whether the condition uses a literal value or a bind variable. In turn the density value is simply the inverse of the number of distinct values in that column i.e. 1 / NDV. This count is available in the DISTINCT_KEYS column of USER_IND_STATISTICS and USER_INDEXES.

The Filter Factor can also be used to calculate the number of matching rows produced by the predicate condition i.e. the Cardinality of the result set. This is used by the Optimizer when multiple conditions are combined, either on the same table or joined tables.

Filter Factors on multiple conditions in a query are combined together according to the standard rules for probabilities:
  • c1 AND c2 ==> FF1 * FF2
  • c1 OR c2 ==> FF1 + FF2 – (FF1 * FF2) i.e. exclude duplicate matches
If the column has a Histogram on it, the Density value will be calculated from the Density values of each bucket within the Histogram that fall within the query range. This might be close to 1 / NDV for the whole table, or it might be different. That is one way in which Histograms can have such a big impact on query execution plans.

Examples


Using the index cost formula of:
  • #Levels + (#Leaf Blocks * Filter Factor) + (Clustering Factor * Filter Factor)
For a query with a condition on column C3, which has an index on it with a Density of 0.001002, 250 Leaf Blocks, 1 Level and 100,000 rows and the same value for Clustering Factor the cost should be:
  • 1 + 250 * 0.001002 + 100000 * 0.001002 = 1 + 0.2505 + 100.2 = 101.4505
The reported I/O cost by Oracle was 102, which agrees very closely with our calculated value. The difference is probably due to rounding errors. The expected cardinality is 100,000 * 0.001002 = 100.2.

No comments: