Monday, 20 July 2009

Sort Merge Join Costing

For a Sort-Merge Join, each data set (Outer and Inner) is read separately and sorted by the joining columns. The resultant sorted data sets are then joined together. The cost formula for this is:
  • [Outer Access Cost] + [Sorting Outer Cost] + [Inner Access Cost] + [Inner Sort Cost]
We know how to calculate the two data Access Costs from earlier posts for different tables and the indexes available on them. How does the Optimizer cost the two Sorts?

The two key points are - how big is the data set to be sorted, and how much memory is available for use as a sort area?

If both of the data sets fit into the available memory, then they can be sorted directly into memory as they are read from their sources (previous steps in the execution plan). This avoids additional disk I/O and only has a CPU cost for the Sorts. The Optimizer calculates an approximate cost for this CPU component according to its own internal formula. Although this CPU cost is included by the Optimizer in its costs, for our purposes we can ignore it. This is because it is generally so small compared to the I/O costs involved in getting the data from disk in the first place, that it is negligible and has no impact on the final choice of which execution plan to use.

If you are using Oracle 10g and its new initialization parameters, with "workarea_size_policy" set to auto, then the amount of memory used by Optimizer as the query sort area will be a percentage of the memory specified for "pga_aggregate_target". (If these are not set correctly, then it will use the value of the older "sort_area_size parameter" - but this is now deprecated and should not be used.)

The percentage it will use is not documented anywhere, but there are several references to 5% on the web. However, in the testing that I did I saw nearer 20% being used. This could be due to the fact that nothing else was happening on this system, so all of the PGA memory was free and available. Or that I had a relatively small PGA aggregate target of 32 MB, and there is a minimum allocation size from within the PGA memory area.

Either way, if you are trying to estimate the cost yourself, it is probably best to use 5% of the PGA memory as your value.

To calculate the size of the data being sorted you need to know the following:
  • The columns being retrieved from each table. Oracle only processes those data columns it needs, and not the whole data row. The columns used will include those needed for the query output itself, plus those needed for filter conditions and joins to other data sets.
  • The size of each of these columns, added together for each data set to give a data total.
  • An additional 6 bytes per data row, as a unique identifier for it, internal to the processing by the Optimizer. This is present in all data sets, regardless of the join method.
  • An additional 12 bytes per data row for sort processing, on 32 bit systems. This would be doubled to 24 bytes on 64 bit systems, as they are memory address pointers. [Jonathan Lewis, Cost Based Oracle Fundamentals].
  • The number of rows or cardinality of each data set.
Adding together the row element sizes and multiplying by the cardinality of the data set, gives the total space needed for each data set. If either of these exceeds the available memory, then the Optimizer assumes that the Sort will be done in chunks, or sets, each written out to disk, and then each sorted sub-set merged together into the final sorted result set.

During this Merge phase, Oracle needs to read in blocks of sorted data from all of the sorted sets into memory at the same time, in order to properly merge them together. For efficiency Oracle uses multi-block reads on the sorted sets, because they are already sorted sequentially on disk and are being used in that order. Multi-block reads reduces the total number of read disk I/Os needed to read in all the sorted sets. Thus the number of sorted sub-sets that can be merged simultaneously is limited by the memory available, as mentioned earlier, and the size of the multi-block reads. The Optimizer terms this the "Sort width" in some of its trace output, but it means the number of sort sets that can be merged together at once.

If there is enough memory after the Sort phase to Merge all of the sorted sets together at once, then they are indeed merged this way, and this is termed a "single pass merge" - one processing of the sorted sets to merge them.

If there is not enough memory to Merge all sorted sets at once, then Oracle will merge as many as it can, and produce a single, larger sorted set of data rows from these. It will then repeat this merging across the next group of sorted sets, and so on to iterate over all of the sets from the initial sort. This will result in a smaller number of larger sorted sets, which hopefully can now be merged together. If not, the merging of sets is repeated until the number of sets is reduced to that which can be merged together. This is termed a "multi-pass merge".

Assuming that the data to be sorted is larger than the memory available, then the cost of the initial sort phase into sorted sets is the cost of writing out the data blocks using single block writes, plus the CPU cost to do the sort. Again, as before, the CPU cost is trivial compared to the disk I/O cost, but both are calculated by the Optimizer. The cost of the Merge phase is again the cost to read in the data from disk, plus the CPU cost of the merge.

When reading the data back in to merge, it seems that Oracle wants to use multi-block reads as described, but may not be able to achieve that for all the blocks for some reason or another. The Optimizer seems to base its calculations on achieving multi-block reads for two thirds of the sorted blocks, but not for the other third i.e. single block I/O used for this.

Sort-Merge Example

The calculations here produce results that agree with the costs given in a 10053 event trace file. In fact, it was a number of such trace files for different data sets that was used to derive the formula presented.

The Outer data set has 7 bytes of data, plus the 6 overhead is 13 bytes of data per row, with cardinality of 500,000 rows. The cost to produce this has been calculated by the Optimizer as being 245.84.

The Inner data set has 3 bytes of data, plus 6 overhead gives 9 bytes of data per row, with cardinality of 100,000 rows. The Optimizer has calculated the cost to produce this as 270.83.

Adding on the 12 bytes per row used by the sort operation, gives 25 bytes per Outer data row and 21 bytes per Inner data row. The Outer data volume is therefore 500,000 * 25 = 12,500,00 bytes = 11.92 MB (dividing by 1024 twice). The maximum memory available for the sort is 6.4 MB in this case, so the data will be sorted to disk and then merged back together. The sorts can be done using 2 sort sets, so only a single pass merge is required.

It seems that when the Outer data set is sorted in multiple sets, so is the Inner data set too, even though it could fit in memory in this case.

To calculate the I/Os involved, we must calculate the number of blocks required to store each of these data sets of 25 and 21 bytes per data row each. With an 8 KB block size, the calculations of the number of blocks are:
  • Outer Rows per block = floor (8192 / 25) = 327
  • Outer blocks = ceiling (500,000 / 327) = 1530
  • Inner Rows per block = floor (8192 / 21) = 390
  • Inner blocks = ceiling (100,000 / 390) = 257
To each of these another block is added, presumably some kind of master control block, giving 1531 and 258 blocks per data set.

The sort phase cost is that of writing out the blocks using single block I/O i.e. 1531 and 258. The merge phase cost is two thirds multi-block I/O and one third single block I/O. Remember that multi-block I/O is adjusted by Oracle by the ratio of single block read to multi-block read and the value used for Multi-Block Read Count (MBRC, typically being the db_file_multiblock_read_count initialization parameter), as seen previously in the formula for a Full Table Scan. Using typical values of these of 8 for MBRC and 12 for SREADTIM and 26 for MREADTIM, gives an adjusted MBRC value of 3.69.

However, Oracle does not use the value of "db_file_multiblock_read_count" for sorts, and instead uses an undocumented parameter of "_smm_auto_min_io_size", which defaults to 56 KB. This is 7 blocks in size in this case.
  • 1530 / 3 = 510
  • 510 * 2 = 1020 (multi-block I/O)
  • Multi-Block Cost is (1020 * 26) / (7 * 12) = 26520 / 84 = 315.714 i.e. 316 I/O cost units
The Merge phase cost is the sum of both types of I/O - 510 + 316 = 826. The Sort phase cost is 1020. And the total Sort/Merge I/O cost is 1530 + 826 = 2356.

Likewise for the Inner data set the Sort phase cost is 258. The Merge cost is derived from:
  • 258 / 3 = 86
  • 86 * 2 = 172
  • Multi-Block Cost is (172 * 26) / (7 * 12) = 4472 / 84 = 53.23 i.e. 54 I/O cost units
  • Total Merge cost is 86 + 54 = 140.
  • Total Sort / Merge cost is 258 + 140 = 398.
These results agree very closely with those reported in the 10053 event trace file for this query. This reports the Merge cost separately as "IO Cost / pass", and the "Total IO sort cost" which is actually for the complete Sort / Merge. The Sort cost itself must be deduced by subtracting these two values from each other.

For the Outer data set the 10053 trace file reported 2349 as the total Sort/Merge cost, and 400 for the Inner data set. These differences can be explained as follows:

The Optimizer used a calculated cardinality of 497,392.56 for the Outer data set, instead of 500,000. This resulted in a lower sort block count of 1523 instead of 1530.

After dividing the block count by three into thirds, it seems to add on two more blocks, presumably to account for an uneven division. Thus 1523 / 3 gives 508 blocks rounded up, but the Optimizer uses 510. Likewise instead of 86 for the third of 258, it uses 88.

This gives the modified results of:
  • Outer Merge I/O Cost = 510 + 316 = 826
  • Outer Sort / Merge I/O Cost = 1523 + 826 = 2349 - as reported in the 10053 file
  • Inner Merge I/O Cost = 88 + 55 = 143 - 142 is reported in the 10053 file
  • Inner Sort / Merge I/O Cost = 258 + 143 = 401 - 400 is reported in the 10053 file
Using the values reported in the 10053 file, the total cost for the Sort Merge Join is:
  • [Outer Access Cost] + [Sorting Outer Cost] + [Inner Access Cost] + [Inner Sort Cost]
  • 245.84 + 2349 + 270.83 + 400 = 3265.67
The reported value in the 10053 trace file is 3331.02, of which 3254.00 is the total I/O cost. The missing 65.35 (3331.02 - 3265.67) is presumably the CPU cost of the two sorts and their merges and the production of the Outer and Inner data sets. As predicted this is much lower than the I/O cost of the sorts and merges.

The heuristic presented of the Merge reading two thirds of the sorted blocks using multi-block reads, and single blocks reads for the remaining third, was arrived at by varying the values of different inputs that the Optimizer would use, and observing the impact on the costs reported. Whenever the value of _smm_auto_min_io_size was adjusted there was a corresponding change in the reported value of the Merge cost (IO Cost / pass). From this the splitting into thirds was deduced, and also that the Sort cost was simply the number of blocks to be sorted due to using single block I/O. Once hypothesized, all subsequent tests agreed with this formula, and none disagreed. Increasing the number of rows to be sorted, or the width of the rows, or the PGA Aggregate Target did change the number of sort sets produced, but the formula presented here still held and the Sort and Merge I/O costs reported agreed with those calculated.

No comments: