Tuesday, 6 February 2018

Hash Join Overflow Costing #3 - Simpler Formula

So far I have offered a formula for how a Hash Join that overflows to disk is costed and confirmed that this is only costed in terms of single block disk operations. While that formula produces very accurate results (less than 1% difference to the reported cost for the Hash Join operation) it requires you to obtain size information on each individual column being retrieved from both the Build and Probe data sets. And this could be quite tedious when there are many columns involved, or you don't have an easy way to work out the source columns involved in each data set. There is a simplification we can make to eliminate this detailed per column information and just use the information reported in the execution plan itself.

All of the columns being retrieved for the Build and Probe data sets are actually included within the "Bytes" value reported in the execution plan for each data set. And this value includes both the individual column data and any per-column overhead as well. What it does not include is the extra 12 bytes per row overhead needed in the hash table itself. We can approximate the size of this second part of the data by using the "Rows" value reported in the execution plan.

Examining the "Bytes" and "Rows" values in the execution plans for my test queries (see first post for details of these queries), I can see that the increase in the Hash Join cost is about 0.0485 per KB of increase in either data set (Build or Probe) for the same number of rows. Previously I determined that there was a 12 byte overhead within the hash table per data row expected.

This produces a revised Hash Join cost formula using only values from the execution plan of:
  • ( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485
Note that the "Bytes" values used need to be in "K" units, whereas the "Rows" is not and so is divided by 1000. Checking this formula against the actual Hash Join costs of the test queries I ran I can see that it has an error of about 1% i.e. it is not as accurate as the previous formula, but is still accurate enough I believe.

Lets check how close this from one of the test queries. Here is the execution plan produced:
--------------------------------------------------------------------------------------
| Id  | Operation     | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      |      |      | 1266 (100)|          |
|   1 |  SORT AGGREGATE     |      |    1 |   30 |      |           |          |
|*  2 |   HASH JOIN     |      | 50000 | 1464K| 1712K| 1266   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| HJ50K  | 50000 | 1123K|      |  365   (1)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| HJ100K |  100K|  683K|      |  728   (1)| 00:00:01 |
--------------------------------------------------------------------------------------
We can see:
  • The Build and Probe data access cost is 365 + 728 = 1093
  • The cost of the Hash Join operation itself is 1266 - 1093 = 173
  • The calculated approximate cost is (1123 + (12 * 50) + 683 + (12 * 100)) * 0.0485
    • = (1123 + 600 + 683 + 1200) * 0.0485
    • = 3606 * 0.0485 
    • = 174.891
  • The difference is +1.891, which is 1.1% of the actual cost
This formula can therefore be a useful check against the cost reported when a Hash Join operation overflows to disk, and for determining which is the biggest cost factor i.e. the Build or the Probe data sets.

No comments: