Monday, 31 August 2009

Hash Join Costing

As already discussed in previous posts Oracle has different join methods available to it, and will choose the one with the lowest cost for any specific join. So far I have described how the Nested Loop and Sort Merge Join methods work. Now I will describe the Hash Join method.

It shares some similarities with the Sort Merge Join method - both data sets are accessed separately, and then joined together - but is obviously different in how it does the join. The cost formula for this is:
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
As for the Sort Merge method, this uses an area of memory and so its cost is affected by whether the controlling data set is smaller than the available memory or not. If it is smaller, then the Hash Join will take place in memory, and the Hash Join Cost component is negligible being only a CPU based cost. If the data set is larger, then the join takes place in stages, with intermediate results being written to disk. Hence the Hash Join Cost component will increase significantly due to the disk I/Os involved in writing out the overflow data, and reading it back in again later on.

The memory available for the Hash Join is the same as for the Sort Merge join, being derived from the pga_aggregate_target initialization parameter.

The main difference between the Sort Merge and Hash Join methods is that the Sort Merge wants to have both data sets in memory before joining them together, whereas the Hash Join method only wants one of the data sets in memory before starting the join. This makes a significant difference to the size of the data sets that can be processed by the Hash Join before it overflows to disk, compared to the Sort Merge join method.

Also the Sort Merge Join wants the data sets sorted before they can be merged, which has an associated cost even if it is only CPU because the data can be sorted in memory. The Hash Join however does not need the data sets pre-sorted in any way, and so avoids this extra cost. This difference ensures that the Hash Join is almost always cheaper than a Sort Merge Join, except under exceptional circumstances.

The Hash Join is particularily efficient compared to the Sort Merge Join where a small data set is joined to a much larger data set. A Sort Merge Join requires that both data sets be sorted, which needs memory for each data set. Whereas a Hash Join only requires memory proportional to the smaller first data set and is unaffected by the size of the second data set. This means that a Hash Join can avoid needing to overflow to disk when one of the data sets is much larger, compared to a Sort Merge Join which would need to overflow to disk to sort the larger data set.

The Hash Join works by reading all of the rows in one of the data sets - termed the Outer data set as before - and puts all of them into a table like structure in memory, assuming that sufficient memory is available. The data rows are placed in the memory table according to the values in the columns used for the join. These values are used in a hash function - hence the name of the join method - that returns a value within the range of the size of the table in memory. The data row is then placed in this location in the table - subject to some caveats.

Then the second, Inner data set is read, the same hash function applied to the same join columns in this data set, and the corresponding location in the memory table read. Now either there is a matching row in this location or there is not. If there is a matching row, then the join takes place and a joined row is produced. If there is no matching row then the inner data row is simply discarded.

Thus the Hash Join method can be relatively fast, with little overhead apart from the hash function itself.

The only caveat, as mentioned, is that it is possible for more than one outer data row to hash to the same location in the memory table. When this happens the data row is basically stored in the next available higher location. When the inner data row is hashed, Oracle compares the next locations in the memory table too, while they have the same join values. This handles the situation of duplicate hash values from different data rows, with minimal overhead.

Clearly the size of the hash table in memory is determined by the number of rows in the first or outer data set. The Optimizer is able to swap the data sets around if one is larger than the other, so that it processes the smaller one first. If this produces a significantly lower cost, then this swapped hash join will be used.

When the size of the first data set is larger than the available memory, Oracle proceeds along the following lines. Note that I have simplified the description, and omitted some of the details, but the gist of it is correct. When hashing the first, outer data set, rows with hash values outside the size of the memory table are written out to disk. But those that do hash within the in memory range are put into the memory table. Then the second, outer data set is processed, and hashed in the same way. Those that hash into an in memory range are processed as normal - join or no join - and those that don't are written out to disk.

The range of hash values covered by the in memory table is adjusted, to cover the second range of values, and the two pending data sets on disk joined. The saved first data set is read back in, and either hashed into memory or written out to another temporary area on disk. Then the second data set is read in, hashed and either joined or not joined if it falls within the in memory range, or written out to another temporary area. This is repeated until the remaining data from the first, outer data set fits in memory, and the second, inner data set processed the one final time.

Note that Oracle has certain optimizations in the implementation of this hash mechanism to both discard rows from the second data set that will not join as early as possible, and to write the temporary pending data sets into pre-hashed sub-sets. These optimizations minimize the total number of disk I/Os that need to be done and minimize multiple passes on the same data rows. Thus the actual Hash Join mechanism differs in detail from my simplified summary of it.

As ever, the costs can be verified by looking at the trace file output from the 10053 event. As described, when the outer data set is small enough to fit in memory the cost of the Hash Join stepitself is very small, being only CPU operations. The majority of the total Hash Join cost is simply the sum of the access cost to the two data sets being joined.
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
Where the data sets are much larger than the available memory, then the Hash Join cost step itself increases significantly. Unfortunately I have not yet been able to determine a single rule that can be applied to determine this cost. Clearly it is dependent on the size of the outer data set, where the overflow data that will not fit in memory is written out to disk, and on the size of the inner data set that is also processed and written out to disk. I have not been able to determine yet how the split between these two affects the reported cost of the Hash Join itself (a lack of time on my part unfortunately). Rather than delay a posting on this I thought I would post what I have now, and later post anything else I discover about the Hash Join.

No comments: