Introduction
The Hash Join join method was introduced in Oracle
version 7 (7.3 specifically I believe), and one of its main goals was to
be a method that lent itself well to being parallelisable. However, it
is such an efficient join method for larger data volumes even in serial
execution that it is often picked by the Optimizer over Nested Loops or
Sort Merge because of its lower execution cost. This makes the Hash
Join method probably the most frequently used method by the Oracle
Optimizer, often appearing in execution plans for SQL queries.
A
Hash Join works by building a table in memory containing the data from
the first data set (termed the Build data set), and then reading the
second data set (the Probe data set) to lookup each data row into the
in-memory table for any match (join). The in-memory table is structured
and accessed by applying a "hash function" to the relevant join data
columns. This hashing has various benefits around performance, handling
any kind of input data value range, and distributing the input values
within the table (minimising bunching values together).
When this
hash table fits in memory the additional cost of the Hash Join
operation is negligible because it only involves CPU and memory
operations and these are multiple orders of magnitude faster than disk
accesses i.e. there is often little or no difference to the total
reported cost of the Hash Join over the sum of the costs of access of
each source data set. However, when the hash table needed is larger
than can fit in available memory in the PGA then it must overflow to
disk, which in turn significantly increases the cost of the Hash Join
operation itself.
A question I have had for a long time is "
How does Oracle cost this overflowing Hash Join operation"?
Can I replicate this cost formula and understand what the main factors
are within this reported cost? Which is a bigger factor - the size of
the Build data set or the Probe data set? Knowing such things it might
offer the possibility of gaining some insights into ways of tuning such
large hash joins. At the least I would know more about how the
overflowing hash join actually works in practice.
Jonathan Lewis gives a description in his
Cost Based Oracle Fundamentals
book of how the overflowing Hash Join operation works, with a formula
for the major cost components involved. However, I have always found
this formula to be more descriptive than quantitative, and to not be
easy to use to arrive at a comparable value to what the Optimizer has
reported.
I would like a more straightforward quantitative
formula that I could use myself to estimate whether a Hash Join will
overflow to disk or not, and how much its cost will be. After some
research I believe I have arrived at such a formula which I will share
here. Note that I'm not saying this is a "perfect" formula, just that
this is the conclusion I have arrived at so far as a result of the tests
I have done, and it seems to fit the results I have very well. I'll
continue to post more details when I refine or revise this in the future
Why Hash Join Overflows
The
limit to the size of a single Hash Table or other "work area" in the
PGA is determined by a hidden, internal initialization parameter of "
_smm_min_size
".
If the size of the Hash Table needed would be larger than this, then
the Optimizer assumes that it will overflow to disk and costs it
accordingly. My notes say that the value of "
_smm_min_size
" is the larger of 128 KB or 0.1% of
PGA_AGGREGATE_TARGET
. I cannot find exactly where I got this information from, but my memory is telling me that it was from
Randolf Geist, possibly in a response to a question on one of the Oracle forums.
The
main reason Oracle limits the potential size of a Hash Table is to
ensure that a high number of other queries can run concurrently and not
be starved of memory within the PGA. The Optimizer is assuming a worst
case scenario to make sure other sessions do not suffer when a very
large query is executed by one session. However, it is possible that
when executed the hash table
will not overflow to disk
i.e. if at the moment that query is executed there is enough free
memory in the PGA, then Oracle will let that session have a larger "work
area" than the value of "
_smm_min_size
". So even though
the Optimizer has costed the Hash Join operation as an overflow to disk
and costed it accordingly, it does not mean that it will always overflow
to disk when executed.
How Overflow Hash Join Works
Jonathan
Lewis gives a description in his book of how the Hash Join works when
it overflows to disk. I won't repeat the details here as they are not
particularly relevant at the end of the day. But a crude summary would
be that:
- The first data set is read once and broken into
"chunks" that are written out to disk (temporary storage), where each
chunk is a contiguous sub-set of the overall hash table
- One or more of these chunks are then kept in memory ready for the pass over the second data set
- The second data set is read in and hashed in the normal way:-
- If it hashes to an in-memory chunk then it is matched as normal
- Otherwise it is written out to disk (temporary storage) split out into chunks on the same basis as the first data set
- Then
remaining chunks of the first data set are read into memory, and the
corresponding chunks of the second data set read again and matched as
normal. This is repeated until all of both data sets have been
processed.
Essentially this is saying that there will be an
additional pass over both sets of data - after the first read of each
data set (already costed within the execution plan), there is an
additional write out to disk of the majority of each data set, followed
by a read back of each data set. Also extra reads and writes may be
needed in the first pass of each data set, to keep the data in the
"chunks" properly grouped together on disk.
It is not clear
whether these write and read operations will be single block or
multi-block disk read operations, or how many of them there will be.
Potentially multi-block reads could be used when reading back in the
pre-hashed chunks from disk. Luckily though this turns out to be
irrelevant to the cost formula I have arrived at.
Deriving Hash Join Overflow Cost
Here is how I went about it. I created a set of standard test tables (see later for SQL DDL), each with a mix of
NUMBER
and
VARCHAR2
columns to pad them out a bit, and populated them using a repeatable
"connect by" data generator with a different number of rows in each test
table. I then ran a query joining 2 of these tables together (see
later for SQL), immediately examined the execution plan (from
dbms_xplan.display_cursor
) and noted the costs of each operation.
Without
any indexes on any of these tables the execution plan was always 2 Full
Table Scans feeding into a Hash Join operation. When the smaller,
build table became large enough the Hash Join operation would overflow
to disk, causing its cost to rise significantly, and a "TempSpc" column to appear in the execution plan with a reported value.
By
varying only one thing at a time between queries I could see how the
Hash Join cost changed when it was overflowing to disk. I was not
interested in those executions where the Hash Join did not overflow to
disk i.e. where the hash table did fit in memory. Only those executions
that involved the Optimizer assuming it would overflow to disk. By
examining the change in cost for the Hash Join operation for a
corresponding change in only one of the joined tables I could deduce a
multiplying factor being used within the underlying Hash Join cost
calculation.
My Oracle version is 12.1 on Oracle Linux, so my
results are only guaranteed to be accurate for that version. I would
assume the results should be the same for 11g, as I don't think anything
significant has changed in how the Hash Join operation is costed, but
that would need to be verified.
BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE 12.1.0.2.0 Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production
Hash Join Overflow Formula and its Accuracy
I started
by comparing the Hash Join cost when the number of columns in a data set
changed i.e. the size of the data set increased for the same number of
rows. Jonathan Lewis states that the memory needed per column is the
storage size of the column itself plus 2 additional bytes. I observed
that the Hash Join cost changed by a factor of 0.0475 per byte per 1000
rows. And that this multiplying factor was the same under different row
counts in either table in the query.
This only involves the
columns needed by the query itself, which the Optimizer must extract and
process, and not all of the columns in the table. In this case it is
the columns referenced in the "
SELECT
" list and those referenced in the "
WHERE
"
clause. And the "storage size" is the number of bytes that Oracle uses
to store that data on disk, which is not necessarily 1 byte per value
or 1 byte per character or digit. For instance, a
NUMBER
is stored as 2 digits per byte.
My
other observation was that when only the row count in one table changed
the Hash Join cost changed by a factor of 0.5675 per 1000 rows. As
this was a constant per row I wondered if this was something to do with
some extra data per row causing extra disk I/Os. And 0.5675 divided by
0.0475 gives 11.9474 which is almost 12, implying a 12 byte overhead per
data row within the Hash Table.
Based on this, I arrived at the following formula for an overflowing Hash Join cost:
- ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.
I
then checked the calculated costs from this formula against the series
of test queries I had been using, and the resultant cost for the
overflowing Hash Join came out almost the same in all cases. The
percentage difference was under 1% in almost all cases, which I take to
be a high degree of accuracy. The only anomalies are for when the
"build data set" is only just bigger than can fit in the PGA, but even
then it is only a 3% difference. As the size of the build data set
increased so the percentage difference decreased.
On the one hand
it might not be surprising to some people that my derived formula produces the
same results using the same inputs that were used to create the formula
in the first place. However, given that the queries I tested varied
both the number of columns being selected, the number of columns being
joined on, and the number of rows in each table, these would appear to
cover the only variables relevant to this formula. And in each of these
cases the change in the reported Hash Join cost from Oracle was always a
multiplier of this fixed constant of 0.0475.
Other Factors
While
I do believe that this formula is true and valid for the system I was
testing on, it may not be true for all other systems. It is likely that
the multiplying factor of 0.0475 will be different on other systems.
Given that this additional cost for the overflowing Hash Join is due to
the additional disk I/Os involved, then it would seem likely that
changes to the system statistics inside Oracle for disk read times would
result in a change in the value of this multiplying factor. I will
investigate this in my next series of tests.
There may or may not
be some small "constant cost value" involved as well within the
formula, for some kind of constant overhead within the overflowing Hash
Join operation. This "constant cost value" would become negligible at
higher data volumes compared to the costs for the build and probe data
sets, but it might explain the slightly larger difference in calculated
cost at the smallest overflowing data set size.
There is also the
concept of "one pass" and "multi-pass" hash joins within Oracle, as
well as "optimal" hash joins. I don't understand the real difference
between these, other than "optimal" is when it fits in memory and the
other two are when it overflows to disk. It is possible that what I've
seen has been the cost for "one pass" overflowing hash joins, and for
even larger data sets a "multi-pass" hash join would be used that would
involve a different cost formula.
The SQL for the table and query
Here is the SQL to create one example table - they are all the same but for name and row counts - and the query used.
Create Table - run from SQL*Plus with 2 command line arguments of the row count and a table name suffix e.g. "
@crhjtab 1000000 100k
".
create table hj&2
tablespace testdata
as
select r pid
, 1 one
, 2 two
, 3 three
, 4 four
, 5 five
, 10 ten
, trunc (r / 10) per10
, trunc (r / 100) per100
, mod (r, 10) mod10
, mod (r, 100) mod100
, mod (r, 1000) mod1000
, mod (r, 10000) mod10000
, mod (r, 100000) mod100000
, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' filler1
, 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler2
, 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler3
from (select rownum r
from (select rownum r from dual connect by level <= 1000) a,
(select rownum r from dual connect by level <= 1000) b,
(select rownum r from dual connect by level <= 1000) c
where rownum <= &1)
;
exec dbms_stats.gather_table_stats (user, upper ('hj&2') )
Query - run from SQL*Plus with 2 command line arguments of table name suffixes e.g. "
@hj0 100k 200k
"
select /* HashTest0 */ sum (hj1.one) sum_b1
from hj&1 hj1, hj&2 hj2
where hj1.pid = hj2.per10
and hj1.mod10 = hj2.mod100 ;
--
select * from table (dbms_xplan.display_cursor) ;
I've only shown one query here, as the others I used are almost the same but for the columns in the "
SELECT
" list. The variations of this query had different numbers of columns in the "
SELECT
" list, to increase the number of columns from the build and/or the probe tables.