Monday 15 June 2009

Optimizer Costing of Joins

Previously I have described how the Oracle Cost Based Optimizer costs all possible execution plans for a SQL statement to arrive at the one with the lowest cost (cheapest), and how it costs access to single tables using access methods of Full Table Scan and Index Range Scans. There are a few other potential access methods to a single table, but these are the primary ones that occur most if not all of the time. I will now describe how the Optimizer goes about costing joins between tables.

Although we often say that a query involves processing data from tables, it is important to realise that internally the query processes a series of data sets. Logically each step in the execution plan produces a result set of data rows, that is in turn consumed by its parent step within the execution plan. So although we might talk about joining "tables" together, the Optimizer is actually always joining "data sets" together. Some of these data sets do come from tables, while others may come from other steps within a more complex execution plan e.g. from a join of other data sets.

The Optimizer normally only joins two data sets together at a time. There are some exceptions, such as using Bit Map Indexes on Star schemas in Data Warehouses. But for most transaction processing systems, the joins will be between two data sets at a time. Joining more than two tables together simply involves the Optimizer first joining two of the tables together to produce a resultant data set, and then joining this to the third table, and so on to join together all the tables in the query.

First, how does the Optimizer determine the lowest cost join execution plan? In simple terms it costs each and every possible join method between the pair of data sets being joined, and chooses the cheapest one from these. That is: it considers different join methods for joining Data Set 1 to Data Set 2, and also considers the join the other way around - Data Set 2 to Data Set 1.

When there are more than two tables to join together, the Optimizer simply iterates over all the possible join combinations between them and the different join methods for each. It picks one data set as the first one, another as the second, costs the joins between them to get a cheapest, then costs the joins of this to the third data set. Then the next iteration does the first data set to the third, and in turn to the second. Then it repeats for all the other combinations - second to first to third, second to third to first, and so on. Thus all possible join paths between the data sets are costed, and the cheapest one chosen.

Second, what join methods are available? There are 3 main join methods:
  • Nested Loop Join
  • Sort Merge Join
  • Hash Join
Each method has a different associated formula for its cost of execution. Generally this is the cost of accessing each data set, already calculated previously by the Optimizer, and the cost of the join itself.

Note that when I say "cost of accessing each data set" for a join between two data sets, the actual access method can be different for each join method, and so have a different cost. It is not required that all three join methods use the same access method to get to the data sets they need to join. The Optimizer will cost the join methods with the appropriate data access and choose the cheapest one.

Thus for 2 tables and 3 join methods the Optimizer must cost 6 separate possible join combinations. For 3 tables this produces 6 different join combinations between them, each costed for the 3 join methods, for a total of 18 different possible join costs. For 4 tables it becomes 72 different joins to cost, and so on.

Fully costing each join combination properly would be very time consuming, so the Optimizer has a number of improvements to reduce this. For the first join combination it starts with the data set with the lowest cardinality - smallest estimated number of rows in it. In turn it joins this to the data set with the next lowest cardinality, and so on. This is because this is likely to produce a relatively low cost plan due to it processing less data rows initially. This is not guaranteed to be the lowest cost, but will generally be relatively low. This first join combination is then remembered as the currently lowest join cost.

When costing subsequent join combinations, if the partial cost so far of a join plan is greater than the currently lowest join cost, then it is abandoned (pruned). This makes sense, because costs can only increase for an execution plan as other costs are accumulated within it. Furthermore, sometimes the cost for just accessing the first data set may be greater than the current lowest join cost, and such a join can be pruned without any further consideration or costing of different join methods. Remember that single table access paths are costed first by the Optimizer for each table referenced in the query. So it already knows what these costs are, with no further calculations required.

Thus the Optimizer is able to cost each join combination and each join method between the data sets, to arrive at the cheapest join cost. The internal optimizations attempt to minimise the total effort involved in costing each of these, by picking a relatively low cost join combination as the first one to be costed, and by immediately pruning out any alternative join plan as soon as its partial cost exceeds the current lowest cost join.

1 comment:

Anonymous said...

Thus for 2 tables and 3 join methods the Optimizer must cost 6 separate possible join combinations. For 3 tables this produces 6 different join combinations between them, each costed for the 3 join methods, for a total of 18 different possible join costs. For 4 tables it becomes 72 different joins to cost, and so on.

when you calculate 3 tables join, the count may be not right, it seems 3*2*1*3*3=54, so 4 tables
4*3*2*1*3*3*3=24*27=648.