Monday, 22 June 2009

Nested Loop Join Costing

As described previously, the optimizer has 3 main join methods available when joining 2 data sets together:
  • Nested Loop Join
  • Sort Merge Join
  • Hash Join
Each method has a different associated formula for its cost of execution. Here I will look at the Nested Loop Join, which is the most straightforward in some respects. Some of this information, such as the costing formula, are given in the Oracle Performance Tuning Guide Manual, in the chapter on the Query Optimizer.

A Nested Loop Join reads data from one data set - the Outer data set - and for each data row retrieves corresponding rows from the other data set - the Inner data set. Clearly the Inner data set is accessed separately for each data row in the Outer data set. And the number of rows in a data set is termed its Cardinality. Hence the cost formula for a Nested Loop Join is as follows, where "Outer" and "Inner" refer to their respective data sets, and square brackets used to enclose each individual element:
  • [Cost of Outer Access] + ([Cardinality of Outer] * [Cost of Inner Access])
The Outer and Inner costs will have been determined earlier by the Optimizer, either as a single table access cost or as the cost of a previous join between data sets. Thus these component costs do not have to be recalculated by the Optimizer.

Example


Consider the following query using a table with 100,000 rows in it over 1252 * 8 KB blocks:

select it1.i1
from insert_test_1 it1, insert_test_1 it2
where it1.i1 = it2.i4
and it1.i3 = 99

I am using the same table twice in the query, with an alias for each occurrence of it.

There are indexes on each column i.e. on i1, i2 (not used here), i3 and i4. The i1 index is unique, but the others are not.

Remember that the Optimizer starts with the data set with the lowest cardinality (smallest number of data rows). In this case this is "it1", because the "i3" column has 100 rows for each possible value. This gives a cardinality of 100 for "it1" as the Outer data set. The lowest cost access to this is using the index on this, which has a cost of 102 (costed using index range scan described in an earlier post).

To join to "it2" in this case, the index on "i4" can be used. In this case no other data columns are needed from the "it2" table, and thus the cost is really that of reading the index leaf block with that particular value. For this index all entries for one value fit within one leaf block (leaf blocks per key value is 1), so the access cost is just 1 to read that particular leaf block.

The Nested Loop Cost is therefore:-
  • 102 + (100 * 1) = 102 + 100 = 202
This was verified by examining a 10053 event trace file when this SQL was executed. The costs reported differ only after the decimal point, due to a combination of rounding of values within Oracle and additional minor costs such as CPU cost (in 10g).

The 10053 trace file also showed that the Optimizer also costed other variations of the Nested Loop Join. This agrees with the statement that the Optimizer costs all possible access and join plans to determine the cheapest one.

Also considered are a full table scan of it2, at a cost of 243.1. The Nested Loop cost of this should be:
  • 102.1 + (100.2 * 243.1) = 102.1 + 24358.2 = 244460.3
The 10053 trace file actually reported a cost of 24580.26 total. Again this would be different due to rounding of results and additional CPU costs. Nevertheless, it is very close to our calculated value for the I/O cost only.

Monday, 15 June 2009

Optimizer Costing of Joins

Previously I have described how the Oracle Cost Based Optimizer costs all possible execution plans for a SQL statement to arrive at the one with the lowest cost (cheapest), and how it costs access to single tables using access methods of Full Table Scan and Index Range Scans. There are a few other potential access methods to a single table, but these are the primary ones that occur most if not all of the time. I will now describe how the Optimizer goes about costing joins between tables.

Although we often say that a query involves processing data from tables, it is important to realise that internally the query processes a series of data sets. Logically each step in the execution plan produces a result set of data rows, that is in turn consumed by its parent step within the execution plan. So although we might talk about joining "tables" together, the Optimizer is actually always joining "data sets" together. Some of these data sets do come from tables, while others may come from other steps within a more complex execution plan e.g. from a join of other data sets.

The Optimizer normally only joins two data sets together at a time. There are some exceptions, such as using Bit Map Indexes on Star schemas in Data Warehouses. But for most transaction processing systems, the joins will be between two data sets at a time. Joining more than two tables together simply involves the Optimizer first joining two of the tables together to produce a resultant data set, and then joining this to the third table, and so on to join together all the tables in the query.

First, how does the Optimizer determine the lowest cost join execution plan? In simple terms it costs each and every possible join method between the pair of data sets being joined, and chooses the cheapest one from these. That is: it considers different join methods for joining Data Set 1 to Data Set 2, and also considers the join the other way around - Data Set 2 to Data Set 1.

When there are more than two tables to join together, the Optimizer simply iterates over all the possible join combinations between them and the different join methods for each. It picks one data set as the first one, another as the second, costs the joins between them to get a cheapest, then costs the joins of this to the third data set. Then the next iteration does the first data set to the third, and in turn to the second. Then it repeats for all the other combinations - second to first to third, second to third to first, and so on. Thus all possible join paths between the data sets are costed, and the cheapest one chosen.

Second, what join methods are available? There are 3 main join methods:
  • Nested Loop Join
  • Sort Merge Join
  • Hash Join
Each method has a different associated formula for its cost of execution. Generally this is the cost of accessing each data set, already calculated previously by the Optimizer, and the cost of the join itself.

Note that when I say "cost of accessing each data set" for a join between two data sets, the actual access method can be different for each join method, and so have a different cost. It is not required that all three join methods use the same access method to get to the data sets they need to join. The Optimizer will cost the join methods with the appropriate data access and choose the cheapest one.

Thus for 2 tables and 3 join methods the Optimizer must cost 6 separate possible join combinations. For 3 tables this produces 6 different join combinations between them, each costed for the 3 join methods, for a total of 18 different possible join costs. For 4 tables it becomes 72 different joins to cost, and so on.

Fully costing each join combination properly would be very time consuming, so the Optimizer has a number of improvements to reduce this. For the first join combination it starts with the data set with the lowest cardinality - smallest estimated number of rows in it. In turn it joins this to the data set with the next lowest cardinality, and so on. This is because this is likely to produce a relatively low cost plan due to it processing less data rows initially. This is not guaranteed to be the lowest cost, but will generally be relatively low. This first join combination is then remembered as the currently lowest join cost.

When costing subsequent join combinations, if the partial cost so far of a join plan is greater than the currently lowest join cost, then it is abandoned (pruned). This makes sense, because costs can only increase for an execution plan as other costs are accumulated within it. Furthermore, sometimes the cost for just accessing the first data set may be greater than the current lowest join cost, and such a join can be pruned without any further consideration or costing of different join methods. Remember that single table access paths are costed first by the Optimizer for each table referenced in the query. So it already knows what these costs are, with no further calculations required.

Thus the Optimizer is able to cost each join combination and each join method between the data sets, to arrive at the cheapest join cost. The internal optimizations attempt to minimise the total effort involved in costing each of these, by picking a relatively low cost join combination as the first one to be costed, and by immediately pruning out any alternative join plan as soon as its partial cost exceeds the current lowest cost join.