It is quite clear that Oracle does not store NULL values within indexes (B-Tree
indexes specifically, being the most common type used within Oracle).
This has been the case for many years - I remember reading about this in
the version 7 days over 15 years ago. My understanding was that this
was an optimization on Oracle's part, as they believed that NULL values
would not be used in most queries, and that storing NULL values in an
index might skew the index somehow if there were a great many number of
NULL values in rows. There are several consequences of this non-storage
of NULL values in an index, as I mentioned in a previous
post on NULL values.
But it seems that some people are still unaware of this issue,
believing that NULL's are stored within indexes, and that as a result
all indexes are equal. Not having any hard evidence to refer people to,
I thought I would invent some simple tests to show that NULL's are not
stored in indexes, and the various side effects of this. Also,
assumptions can be dangerous things when taken too far, and the memory
plays tricks over what I once read in the past. So providing some real
test cases would either verify any assumptions I had made, or show up
them as being wrong.
Tests Overview
I'll create two simple
tables - one as a master table of codes (reference with 10,000 rows),
and another larger table that refers to the master table as a foreign
key (transaction with 1,000,000 rows). This means that there are 100
rows per Reference Identifier value in Transaction. The Transaction
table will have two such foreign key columns - one without NULL's
(ref_id_nn), and one with NULL's (ref_id_n).
Indexes will be
created on these columns, and some example queries run using either
reference column to show whether the indexes are being used or not. We
can also look at the statistics on the indexes themselves to tell us
something about whether NULL values are stored in them or not. I
provide all the SQL to create and populate these two tables at the end
of this post - I assume most people are interested in the results first.
Index Statistics
Immediately
we can see that NULL values are not stored in an index by looking at
the statistics for the indexes on the Transaction table.
col index_name heading 'Index' format a20
col lb heading 'Leaf|Blocks' format 999,999
col dk heading 'Distinct|Keys' format 999,999,999
col cf heading 'Clustering|Factor' format 999,999,999
col nr heading 'Num Rows' format 999,999,999
--
select i.index_name,
i.leaf_blocks lb,
i.num_rows nr,
i.distinct_keys dk,
i.clustering_factor cf
from user_ind_statistics i
where i.table_name = 'TRANSACTION'
/
The results of this query are:
Leaf Distinct Clustering
Index Blocks Num Rows Keys Factor
-------------------- -------- ------------ ------------ ------------
PK_TRANSACTION 1,875 1,000,000 1,000,000 13,147
IX1_TRANSACTION_NN 2,090 1,000,000 10,000 1,000,000
IX1_TRANSACTION_N 2,090 999,900 9,999 999,900
Look at the "Number of Rows" values and the "Number of Distinct
Keys" values for the two indexes on the two foreign key columns. The
index on the column allowing NULL's has one less distinct key value -
because the NULL value has not been stored in the index. Whereas the
index on the column without any NULL's has the full 10,000 distinct key
values in it. This is also reflected in the number of rows covered by
the index - 100 less for the index with NULL values than for the index
without NULL values. So already we have evidence that NULL values are
not stored within a B-Tree index i.e. they are ignored by the index itself.
Queries
In my previous blog post I made two claims about indexes on columns allowing NULL values not being used:
- Such an index cannot be used to satisfy an "IS NULL" query constraint
- The index also cannot be used to satisfy a "column != value" query constraint
Lets
test these claims. In principle we want to run the same query against
each of the two foreign key columns, and see whether the corresponding
index is used. However, this may not be possible for one reason or
another, as I'll explain in each case.
"IS NULL" Constraint
The query is:
select count (*) from transaction where ref_id_n is null ;
When executed the query has the following execution plan (as from
dbms_xplan.display_cursor
immediately after executing the query):
SQL_ID cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null
Plan hash value: 185056574
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3578 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| TRANSACTION | 1 | 4 | 3578 (1)| 00:00:43 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("REF_ID_N" IS NULL)
So the index was ignored and a full table scan was done, even though there were only 100 rows to retrieve.
Unfortunately
we cannot run this query using the other column, because it does not
allow NULL values. So the Optimizer will know that this query cannot
return any rows at all. (It actually did an Index Fast Full Scan when I
ran it, and returned a count of 0 rows). Instead we can change the
query to compare to a real value, rather than NULL, and see that it uses
the index for the execution now, even though it is still 100 rows again
from the 1,000,000 in the table.
SQL_ID 7burxn278qj8b, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5
Plan hash value: 1807025728
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | INDEX RANGE SCAN| IX1_TRANSACTION_N | 101 | 404 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("REF_ID_N"=5)
This shows that an index on a column allowing NULL values cannot be used when executing a query with an "
IS NULL
" constraint on that column.
"!=" Constraint
I've
since realised that the Optimizer would be more likely to choose a Full
Table Scan whether the column allowed NULL or not for a "not equals"
constraint, simply because of the number of rows being returned. If 100
rows match a value, then 999,900 do not match that value. And a Full
Table Scan will be the optimal access method regardless of whether the
column allows NULL values or not.
However, if the query only did a
"count (*)" and did not retrieve any data columns from the table, then
potentially an index could be used - with the Index Fast Full Scan
access method. Consider the following performed on each of the two
foreign key columns:
select count (*) from transaction where ref_id_nn != 5 ;
When I execute this I get 999,900 for the non-NULL column, and
999,800 for the NULL allowed column. Whoops! Something is not right
somewhere. Again, we are back to the point that NULL is not a "normal
value" within the value space of the data type of the column. And a
NULL value is treated differently by Oracle rather than just another
value. So although NULL is never equal to another real value, it is
also
never not equal to another real value. Even if I force a full table scan via the "
full (transaction)
"
hint, I still get a count of 999,800 for the "ref_id_n != 5"
constraint. (The execution plan from dbms_xplan.display_cursor showed
that "TABLE ACCESS FULL" was used on TRANSACTION). [This very point
about "!=" was also mentioned in a comment by Narenda on my
previous post, and I had independently stumbled upon this as a result of this series of tests].
So
even though we have not been able to show one of my claims about NULL
values and indexes, we have instead stumbled upon another issue with
NULL values.
A NULL value is never equal to a real value, and also never not equal to a real value.
I am sure that I have seen Jonathan Lewis discuss this in a
presentation when talking about rewriting SQL statements into their
equivalents. If a column allows NULL values then you must be careful
when rewriting equality constraints into equivalent inequality
constraints.
Further testing threw up another anomaly. Rather
than test "!=" I thought I would test "not in", but contrived so that
there is only one value in the "not in" list. For the NULL allowed
column the Optimizer chose a full table scan:
SQL> select count (*) from transaction where ref_id_n not in (select 5 from dual) ;
COUNT(*)
----------
999800
SQL> @xplastexec
SQL_ID 0f3nh5h11yu6p, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n not in (select 5 from dual)
Plan hash value: 297680891
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 23425 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL| TRANSACTION | 998K| 3900K| 3578 (1)| 00:00:43 |
|* 4 | FILTER | | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter( IS NULL)
4 - filter(LNNVL(:B1<>5))
While for the non-NULL column it used the index:
SQL> select count (*) from transaction where ref_id_nn not in (select 5 from dual) ;
COUNT(*)
----------
999900
SQL> @xplastexec
SQL_ID fw2svuam6vzb1, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn not in (select 5 from dual)
Plan hash value: 1550919231
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 20422 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | FILTER | | | | | |
| 3 | INDEX FAST FULL SCAN| IX1_TRANSACTION_NN | 998K| 3900K| 575 (2)| 00:00:07 |
|* 4 | FILTER | | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter( IS NULL)
4 - filter(LNNVL(:B1<>5))
So I think that this example does show that there are times when
the Optimizer will not use an index on a column allowing NULL values,
but would otherwise use an index if NULL's were not allowed.
NULL as "Anything"
Another
pattern I have seen is where a NULL value is used to mean "any value",
rather than "no value". So in a system that involves processing
transactions according to type, and each type is processed by a
different handler, then a NULL value means that the transaction can be
processed by any available handler. This leads to a query similar to
the following:
select ... from transaction where (ref_id_n = 5 or ref_id_n is NULL) ...
The execution plan for this cannot use the index because NULL values are not stored in it. Hence you get an execution plan of
SQL_ID 4tsn8vbt3s3gh, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5 or ref_id_n is null
Plan hash value: 185056574
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 3580 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | TABLE ACCESS FULL| TRANSACTION | 101 | 404 | 3580 (1)| 00:00:43 |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(("REF_ID_N"=5 OR "REF_ID_N" IS NULL))
However, if the database design was different, and the value '0'
used as a special case for "any handler", with a corresponding row
inserted into the "reference" table, then the query could be changed to
the following and Oracle would use the index on the column:
SQL_ID d2v9fgscm8nrq, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn = 5 or ref_id_nn = 0
Plan hash value: 1215059433
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 4 (100)| |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
| 2 | INLIST ITERATOR | | | | | |
|* 3 | INDEX RANGE SCAN| IX1_TRANSACTION_NN | 201 | 804 | 4 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access(("REF_ID_NN"=0 OR "REF_ID_NN"=5))
Again, the presence of NULL values results in sub-optimal query
execution, and a different design can provide a better solution,
allowing the Optimizer to use an index for much faster access (cost of 4
versus 3580).
Conclusion
For now I think I have written
enough in one post about NULL's and indexes. My main point was to
provide some real evidence that NULL values are not stored in B-Tree
indexes in Oracle, and that if NULL values are allowed in columns it can
affect whether the Optimizer will use an index or not. And I think I
have done that. I'm sure that people experienced with Oracle already
know this, but I just wanted to provide some proof points for anybody
who doubted it for any reason. As ever, the benefit of some real tests
is that you get to either verify any assumptions you have made or show
them up to be wrong and correct them. In this case I have been able to
correct some weak assumptions I had, and have learnt some more about how
the Oracle Optimizer works, specifically when handling NULL values.
Table Creation SQL
Here
is the SQL to create the tables, populate them with data and index
them. Note the two foreign key columns in "transaction", and that one
value is mapped to NULL when generating the data.
--
-- Create the tables necessary for the NULL values investigation
-- Create tables, load in generated data, index, gather statistics
--
prompt Creating tables
--
-- Reference table - a lookup of master values
--
create table reference (
ref_id number (*,0) not null,
group_id number (*,0) not null,
description varchar2 (512) not null,
constraint pk_reference primary key (ref_id)
) ;
--
-- Transaction table
--
create table transaction (
trans_id number (*,0) not null,
ref_id_nn number (*,0) not null,
ref_id_n number (*,0) ,
location_id number (*,0) not null,
padding varchar2 (512),
constraint pk_transaction primary key (trans_id)
) ;
--
alter table transaction add constraint fk_ref_nn
foreign key (ref_id_nn) references reference (ref_id) ;
--
alter table transaction add constraint fk_ref_n
foreign key (ref_id_n) references reference (ref_id) ;
--
prompt Loading data
--
-- SPECIAL DUMMY ROW!! Keep in sync with any newly added columns!*!
insert into reference values (0, 0, 'Unknown - Dummy entry for referential integrity purposes') ;
--
insert into reference
with generator as
(select rownum r
from (select rownum r from dual connect by rownum <= 1000) a,
(select rownum r from dual connect by rownum <= 1000) b,
(select rownum r from dual connect by rownum <= 1000) c
where rownum <= 1000000
)
select r,
mod (r, 1000),
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' || to_char (r, '00000000')
from generator g
where g.r <= 10000
/
commit ;
--
insert into transaction
with generator as
(select rownum r
from (select rownum r from dual connect by rownum <= 1000) a,
(select rownum r from dual connect by rownum <= 1000) b,
(select rownum r from dual connect by rownum <= 1000) c
where rownum <= 1000000
)
select r,
mod (r, 10000),
decode (mod (r, 10000), 0, null, mod (r, 10000) + 1),
mod (r, 1999),
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' || to_char (r, '000000000')
from generator g
where g.r <= 1000000
/
commit ;
--
prompt Creating indexes
--
create index ix1_transaction_nn on transaction (ref_id_nn) ;
create index ix1_transaction_n on transaction (ref_id_n) ;
--
prompt Gathering Statistics
--
exec dbms_stats.gather_schema_stats ('JB')
--
So there are 10,000 rows in Reference, and 1,000,000 rows in
Transaction. This means that there are 100 rows per Reference
Identifier value in Transaction.