Wednesday, 13 June 2012

NULLs and Indexes

It is quite clear that Oracle does not store NULL values within indexes (B-Tree indexes specifically, being the most common type used within Oracle). This has been the case for many years - I remember reading about this in the version 7 days over 15 years ago. My understanding was that this was an optimization on Oracle's part, as they believed that NULL values would not be used in most queries, and that storing NULL values in an index might skew the index somehow if there were a great many number of NULL values in rows. There are several consequences of this non-storage of NULL values in an index, as I mentioned in a previous post on NULL values. But it seems that some people are still unaware of this issue, believing that NULL's are stored within indexes, and that as a result all indexes are equal. Not having any hard evidence to refer people to, I thought I would invent some simple tests to show that NULL's are not stored in indexes, and the various side effects of this. Also, assumptions can be dangerous things when taken too far, and the memory plays tricks over what I once read in the past. So providing some real test cases would either verify any assumptions I had made, or show up them as being wrong.

Tests Overview

I'll create two simple tables - one as a master table of codes (reference with 10,000 rows), and another larger table that refers to the master table as a foreign key (transaction with 1,000,000 rows). This means that there are 100 rows per Reference Identifier value in Transaction. The Transaction table will have two such foreign key columns - one without NULL's (ref_id_nn), and one with NULL's (ref_id_n).

Indexes will be created on these columns, and some example queries run using either reference column to show whether the indexes are being used or not. We can also look at the statistics on the indexes themselves to tell us something about whether NULL values are stored in them or not. I provide all the SQL to create and populate these two tables at the end of this post - I assume most people are interested in the results first.

Index Statistics

Immediately we can see that NULL values are not stored in an index by looking at the statistics for the indexes on the Transaction table.

col index_name heading 'Index'     format a20
col lb heading 'Leaf|Blocks'       format 999,999     
col dk heading 'Distinct|Keys'     format 999,999,999 
col cf heading 'Clustering|Factor' format 999,999,999 
col nr heading 'Num Rows'          format 999,999,999 
--
select i.index_name,
       i.leaf_blocks lb,
       i.num_rows nr,
       i.distinct_keys dk,
       i.clustering_factor cf
  from user_ind_statistics i
 where i.table_name = 'TRANSACTION'
/
The results of this query are:
                         Leaf                  Distinct   Clustering
Index                  Blocks     Num Rows         Keys       Factor
-------------------- -------- ------------ ------------ ------------
PK_TRANSACTION          1,875    1,000,000    1,000,000       13,147
IX1_TRANSACTION_NN      2,090    1,000,000       10,000    1,000,000
IX1_TRANSACTION_N       2,090      999,900        9,999      999,900

Look at the "Number of Rows" values and the "Number of Distinct Keys" values for the two indexes on the two foreign key columns. The index on the column allowing NULL's has one less distinct key value - because the NULL value has not been stored in the index. Whereas the index on the column without any NULL's has the full 10,000 distinct key values in it. This is also reflected in the number of rows covered by the index - 100 less for the index with NULL values than for the index without NULL values. So already we have evidence that NULL values are not stored within a B-Tree index i.e. they are ignored by the index itself.

Queries

In my previous blog post I made two claims about indexes on columns allowing NULL values not being used:
  • Such an index cannot be used to satisfy an "IS NULL" query constraint
  • The index also cannot be used to satisfy a "column != value" query constraint
Lets test these claims. In principle we want to run the same query against each of the two foreign key columns, and see whether the corresponding index is used. However, this may not be possible for one reason or another, as I'll explain in each case.

"IS NULL" Constraint

The query is:
select count (*) from transaction where ref_id_n is null ;
When executed the query has the following execution plan (as from dbms_xplan.display_cursor immediately after executing the query):
SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)
So the index was ignored and a full table scan was done, even though there were only 100 rows to retrieve.

Unfortunately we cannot run this query using the other column, because it does not allow NULL values. So the Optimizer will know that this query cannot return any rows at all. (It actually did an Index Fast Full Scan when I ran it, and returned a count of 0 rows). Instead we can change the query to compare to a real value, rather than NULL, and see that it uses the index for the execution now, even though it is still 100 rows again from the 1,000,000 in the table.
SQL_ID  7burxn278qj8b, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5

Plan hash value: 1807025728

---------------------------------------------------------------------------------------
| Id  | Operation         | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                   |       |       |     3 (100)|          |
|   1 |  SORT AGGREGATE   |                   |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX1_TRANSACTION_N |   101 |   404 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N"=5)
This shows that an index on a column allowing NULL values cannot be used when executing a query with an "IS NULL" constraint on that column.

"!=" Constraint

I've since realised that the Optimizer would be more likely to choose a Full Table Scan whether the column allowed NULL or not for a "not equals" constraint, simply because of the number of rows being returned. If 100 rows match a value, then 999,900 do not match that value. And a Full Table Scan will be the optimal access method regardless of whether the column allows NULL values or not.

However, if the query only did a "count (*)" and did not retrieve any data columns from the table, then potentially an index could be used - with the Index Fast Full Scan access method. Consider the following performed on each of the two foreign key columns:
select count (*) from transaction where ref_id_nn != 5 ;
When I execute this I get 999,900 for the non-NULL column, and 999,800 for the NULL allowed column. Whoops! Something is not right somewhere. Again, we are back to the point that NULL is not a "normal value" within the value space of the data type of the column. And a NULL value is treated differently by Oracle rather than just another value. So although NULL is never equal to another real value, it is also never not equal to another real value. Even if I force a full table scan via the "full (transaction)" hint, I still get a count of 999,800 for the "ref_id_n != 5" constraint. (The execution plan from dbms_xplan.display_cursor showed that "TABLE ACCESS FULL" was used on TRANSACTION). [This very point about "!=" was also mentioned in a comment by Narenda on my previous post, and I had independently stumbled upon this as a result of this series of tests].

So even though we have not been able to show one of my claims about NULL values and indexes, we have instead stumbled upon another issue with NULL values. A NULL value is never equal to a real value, and also never not equal to a real value. I am sure that I have seen Jonathan Lewis discuss this in a presentation when talking about rewriting SQL statements into their equivalents. If a column allows NULL values then you must be careful when rewriting equality constraints into equivalent inequality constraints.

Further testing threw up another anomaly. Rather than test "!=" I thought I would test "not in", but contrived so that there is only one value in the "not in" list. For the NULL allowed column the Optimizer chose a full table scan:
SQL> select count (*) from transaction where ref_id_n not in (select 5 from dual) ;

  COUNT(*)
----------
    999800

SQL> @xplastexec

SQL_ID  0f3nh5h11yu6p, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n not in (select 5 from dual)

Plan hash value: 297680891

-----------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |       |       | 23425 (100)|          |
|   1 |  SORT AGGREGATE     |             |     1 |     4 |            |          |
|*  2 |   FILTER            |             |       |       |            |          |
|   3 |    TABLE ACCESS FULL| TRANSACTION |   998K|  3900K|  3578   (1)| 00:00:43 |
|*  4 |    FILTER           |             |       |       |            |          |
|   5 |     FAST DUAL       |             |     1 |       |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))

While for the non-NULL column it used the index:
SQL> select count (*) from transaction where ref_id_nn not in (select 5 from dual) ;

  COUNT(*)
----------
    999900

SQL> @xplastexec

SQL_ID  fw2svuam6vzb1, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn not in (select 5 from dual)

Plan hash value: 1550919231

---------------------------------------------------------------------------------------------
| Id  | Operation              | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                    |       |       | 20422 (100)|          |
|   1 |  SORT AGGREGATE        |                    |     1 |     4 |            |          |
|*  2 |   FILTER               |                    |       |       |            |          |
|   3 |    INDEX FAST FULL SCAN| IX1_TRANSACTION_NN |   998K|  3900K|   575   (2)| 00:00:07 |
|*  4 |    FILTER              |                    |       |       |            |          |
|   5 |     FAST DUAL          |                    |     1 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))
So I think that this example does show that there are times when the Optimizer will not use an index on a column allowing NULL values, but would otherwise use an index if NULL's were not allowed.

NULL as "Anything"

Another pattern I have seen is where a NULL value is used to mean "any value", rather than "no value". So in a system that involves processing transactions according to type, and each type is processed by a different handler, then a NULL value means that the transaction can be processed by any available handler. This leads to a query similar to the following:
select ... from transaction where (ref_id_n = 5 or ref_id_n is NULL) ...
The execution plan for this cannot use the index because NULL values are not stored in it. Hence you get an execution plan of
SQL_ID  4tsn8vbt3s3gh, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5 or ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3580 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |   101 |   404 |  3580   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("REF_ID_N"=5 OR "REF_ID_N" IS NULL))
However, if the database design was different, and the value '0' used as a special case for "any handler", with a corresponding row inserted into the "reference" table, then the query could be changed to the following and Oracle would use the index on the column:
SQL_ID  d2v9fgscm8nrq, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn = 5 or ref_id_nn = 0

Plan hash value: 1215059433

-----------------------------------------------------------------------------------------
| Id  | Operation          | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                    |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                    |     1 |     4 |            |          |
|   2 |   INLIST ITERATOR  |                    |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| IX1_TRANSACTION_NN |   201 |   804 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(("REF_ID_NN"=0 OR "REF_ID_NN"=5))
Again, the presence of NULL values results in sub-optimal query execution, and a different design can provide a better solution, allowing the Optimizer to use an index for much faster access (cost of 4 versus 3580).

Conclusion

For now I think I have written enough in one post about NULL's and indexes. My main point was to provide some real evidence that NULL values are not stored in B-Tree indexes in Oracle, and that if NULL values are allowed in columns it can affect whether the Optimizer will use an index or not. And I think I have done that. I'm sure that people experienced with Oracle already know this, but I just wanted to provide some proof points for anybody who doubted it for any reason. As ever, the benefit of some real tests is that you get to either verify any assumptions you have made or show them up to be wrong and correct them. In this case I have been able to correct some weak assumptions I had, and have learnt some more about how the Oracle Optimizer works, specifically when handling NULL values.

Table Creation SQL

Here is the SQL to create the tables, populate them with data and index them. Note the two foreign key columns in "transaction", and that one value is mapped to NULL when generating the data.
--
-- Create the tables necessary for the NULL values investigation
-- Create tables, load in generated data, index, gather statistics
--
prompt Creating tables
--
-- Reference table - a lookup of master values
--
create table reference (
ref_id number (*,0) not null,
group_id number (*,0) not null,
description varchar2 (512) not null,
constraint pk_reference primary key (ref_id)
) ;
--
-- Transaction table
--
create table transaction (
trans_id number (*,0) not null,
ref_id_nn number (*,0) not null,
ref_id_n number (*,0) ,
location_id number (*,0) not null,
padding varchar2 (512),
constraint pk_transaction primary key (trans_id)
) ;
--
alter table transaction add constraint fk_ref_nn 
foreign key (ref_id_nn) references reference (ref_id) ;
--
alter table transaction add constraint fk_ref_n
foreign key (ref_id_n) references reference (ref_id) ;
--
prompt Loading data
--
-- SPECIAL DUMMY ROW!! Keep in sync with any newly added columns!*!
insert into reference values (0, 0, 'Unknown - Dummy entry for referential integrity purposes') ;
--
insert into reference
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 1000),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' || to_char (r, '00000000')
  from generator g
where g.r <= 10000
/
commit ;
--
insert into transaction
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 10000),
    decode (mod (r, 10000), 0, null, mod (r, 10000) + 1),
    mod (r, 1999),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' || to_char (r, '000000000')
  from generator g
where g.r <= 1000000
/
commit ;
--
prompt Creating indexes
--
create index ix1_transaction_nn on transaction (ref_id_nn) ;
create index ix1_transaction_n  on transaction (ref_id_n) ;
--
prompt Gathering Statistics
--
exec dbms_stats.gather_schema_stats ('JB') 
--
So there are 10,000 rows in Reference, and 1,000,000 rows in Transaction. This means that there are 100 rows per Reference Identifier value in Transaction.

2 comments:

Anonymous said...

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1 (100)| |
|* 1 | INDEX RANGE SCAN| T1_IDX | 4 | 20 | 1 (0)| 00:00:15 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("N1">2 AND "N2" IS NULL)
filter("N2" IS NULL)

MrTree said...

The previous poster shows that indexes CAN be used to find NULLs in some circumstances. I'll do it here as well using your tables.

SQL> create index ix1_transaction_n_loc on transaction (ref_id_n,location_id);

Index created.

SQL> select count(*) from transaction where ref_id_n is null;

Execution Plan
----------------------------------------------------------
Plan hash value: 1786339511

-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | INDEX RANGE SCAN| IX1_TRANSACTION_N_LOC | 100 | 400 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("REF_ID_N" IS NULL)