Wednesday 20 June 2012

NULLs & Multi-Column Indexes

Previously I showed some of the issues with allowing NULL values in columns and how indexes might be ignored by the Optimizer. All of the examples I gave involved single column indexes only, and showed that NULL values are not stored in such an index. As others have pointed out in comments, this is not necessarily true for multi-column indexes. If at least one of the columns in the index does not allow NULL values then an index entry is stored with the values of all of the indexed columns. This means that a NULL value can be stored in a multi-column index. Lets test that through an example.

Previously we saw that if we had an index on the ref_id_n column in the transaction table, which allowed NULL values, then the Optimizer would not use the index for an "IS NULL" constraint:
SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)
The index on ref_id_n is being ignored, as the NULL values are not stored in that index. However, if we create another index, with ref_id_n as the first column followed by another column that does not allow NULL values, then the index will contain entries where ref_id_n is NULL. With such an index, the Optimizer can now use it for an "IS NULL" constraint:
SQL> create index ix2_transaction on transaction (ref_id_n, ref_id_nn) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null ;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  8tubzdty7vdnv, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 176942238

-------------------------------------------------------------------------------------
| Id  | Operation         | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                 |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE   |                 |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX2_TRANSACTION |     1 |     4 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N" IS NULL)
So multi-column indexes can store NULL values in them, providing at least one column does not allow NULL values. The benefit of such an index can be significant - in this test case the cost came down from 3,578 to 2, simply because the number of NULL rows were so few. But an index on only the ref_id_n column itself is of no use for this query, and is ignored by the Optimizer.

Potentially you can also gain some benefit from an index where the ref_id_n column is not the leading index, as the index may be smaller in size than the table, and the Optimizer may chose an Index Full Scan rather than a Full Table Scan. And that is the case with the test data set I have been using:
SQL> drop index ix2_transaction ;

Index dropped.

SQL> create index ix2_transaction on transaction (ref_id_nn, ref_id_n) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  2j64r2n1nq4xm, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 1095380460

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |       |       |   726 (100)|          |
|   1 |  SORT AGGREGATE       |                 |     1 |     4 |            |          |
|*  2 |   INDEX FAST FULL SCAN| IX2_TRANSACTION |     1 |     4 |   726   (1)| 00:00:09 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)

The Index Fast Full Scan is costed at 726, compared to the 3,578 of the Full Table Scan. Being a "count (*)" only, no other data columns are needed from the table itself.

So if you are allowing NULL values in columns within your database design, and you want to find those rows that have a NULL value stored, then you cannot use an index on just that column alone. You will need a multi-column index, and include another column that does not allow NULL values. However, it may be that the only column that does not allow NULL values in your database design is the primary key column(s) itself, if you simply allow NULLs for every column by default.

Wednesday 13 June 2012

NULLs and Indexes

It is quite clear that Oracle does not store NULL values within indexes (B-Tree indexes specifically, being the most common type used within Oracle). This has been the case for many years - I remember reading about this in the version 7 days over 15 years ago. My understanding was that this was an optimization on Oracle's part, as they believed that NULL values would not be used in most queries, and that storing NULL values in an index might skew the index somehow if there were a great many number of NULL values in rows. There are several consequences of this non-storage of NULL values in an index, as I mentioned in a previous post on NULL values. But it seems that some people are still unaware of this issue, believing that NULL's are stored within indexes, and that as a result all indexes are equal. Not having any hard evidence to refer people to, I thought I would invent some simple tests to show that NULL's are not stored in indexes, and the various side effects of this. Also, assumptions can be dangerous things when taken too far, and the memory plays tricks over what I once read in the past. So providing some real test cases would either verify any assumptions I had made, or show up them as being wrong.

Tests Overview

I'll create two simple tables - one as a master table of codes (reference with 10,000 rows), and another larger table that refers to the master table as a foreign key (transaction with 1,000,000 rows). This means that there are 100 rows per Reference Identifier value in Transaction. The Transaction table will have two such foreign key columns - one without NULL's (ref_id_nn), and one with NULL's (ref_id_n).

Indexes will be created on these columns, and some example queries run using either reference column to show whether the indexes are being used or not. We can also look at the statistics on the indexes themselves to tell us something about whether NULL values are stored in them or not. I provide all the SQL to create and populate these two tables at the end of this post - I assume most people are interested in the results first.

Index Statistics

Immediately we can see that NULL values are not stored in an index by looking at the statistics for the indexes on the Transaction table.

col index_name heading 'Index'     format a20
col lb heading 'Leaf|Blocks'       format 999,999     
col dk heading 'Distinct|Keys'     format 999,999,999 
col cf heading 'Clustering|Factor' format 999,999,999 
col nr heading 'Num Rows'          format 999,999,999 
--
select i.index_name,
       i.leaf_blocks lb,
       i.num_rows nr,
       i.distinct_keys dk,
       i.clustering_factor cf
  from user_ind_statistics i
 where i.table_name = 'TRANSACTION'
/
The results of this query are:
                         Leaf                  Distinct   Clustering
Index                  Blocks     Num Rows         Keys       Factor
-------------------- -------- ------------ ------------ ------------
PK_TRANSACTION          1,875    1,000,000    1,000,000       13,147
IX1_TRANSACTION_NN      2,090    1,000,000       10,000    1,000,000
IX1_TRANSACTION_N       2,090      999,900        9,999      999,900

Look at the "Number of Rows" values and the "Number of Distinct Keys" values for the two indexes on the two foreign key columns. The index on the column allowing NULL's has one less distinct key value - because the NULL value has not been stored in the index. Whereas the index on the column without any NULL's has the full 10,000 distinct key values in it. This is also reflected in the number of rows covered by the index - 100 less for the index with NULL values than for the index without NULL values. So already we have evidence that NULL values are not stored within a B-Tree index i.e. they are ignored by the index itself.

Queries

In my previous blog post I made two claims about indexes on columns allowing NULL values not being used:
  • Such an index cannot be used to satisfy an "IS NULL" query constraint
  • The index also cannot be used to satisfy a "column != value" query constraint
Lets test these claims. In principle we want to run the same query against each of the two foreign key columns, and see whether the corresponding index is used. However, this may not be possible for one reason or another, as I'll explain in each case.

"IS NULL" Constraint

The query is:
select count (*) from transaction where ref_id_n is null ;
When executed the query has the following execution plan (as from dbms_xplan.display_cursor immediately after executing the query):
SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)
So the index was ignored and a full table scan was done, even though there were only 100 rows to retrieve.

Unfortunately we cannot run this query using the other column, because it does not allow NULL values. So the Optimizer will know that this query cannot return any rows at all. (It actually did an Index Fast Full Scan when I ran it, and returned a count of 0 rows). Instead we can change the query to compare to a real value, rather than NULL, and see that it uses the index for the execution now, even though it is still 100 rows again from the 1,000,000 in the table.
SQL_ID  7burxn278qj8b, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5

Plan hash value: 1807025728

---------------------------------------------------------------------------------------
| Id  | Operation         | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                   |       |       |     3 (100)|          |
|   1 |  SORT AGGREGATE   |                   |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX1_TRANSACTION_N |   101 |   404 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N"=5)
This shows that an index on a column allowing NULL values cannot be used when executing a query with an "IS NULL" constraint on that column.

"!=" Constraint

I've since realised that the Optimizer would be more likely to choose a Full Table Scan whether the column allowed NULL or not for a "not equals" constraint, simply because of the number of rows being returned. If 100 rows match a value, then 999,900 do not match that value. And a Full Table Scan will be the optimal access method regardless of whether the column allows NULL values or not.

However, if the query only did a "count (*)" and did not retrieve any data columns from the table, then potentially an index could be used - with the Index Fast Full Scan access method. Consider the following performed on each of the two foreign key columns:
select count (*) from transaction where ref_id_nn != 5 ;
When I execute this I get 999,900 for the non-NULL column, and 999,800 for the NULL allowed column. Whoops! Something is not right somewhere. Again, we are back to the point that NULL is not a "normal value" within the value space of the data type of the column. And a NULL value is treated differently by Oracle rather than just another value. So although NULL is never equal to another real value, it is also never not equal to another real value. Even if I force a full table scan via the "full (transaction)" hint, I still get a count of 999,800 for the "ref_id_n != 5" constraint. (The execution plan from dbms_xplan.display_cursor showed that "TABLE ACCESS FULL" was used on TRANSACTION). [This very point about "!=" was also mentioned in a comment by Narenda on my previous post, and I had independently stumbled upon this as a result of this series of tests].

So even though we have not been able to show one of my claims about NULL values and indexes, we have instead stumbled upon another issue with NULL values. A NULL value is never equal to a real value, and also never not equal to a real value. I am sure that I have seen Jonathan Lewis discuss this in a presentation when talking about rewriting SQL statements into their equivalents. If a column allows NULL values then you must be careful when rewriting equality constraints into equivalent inequality constraints.

Further testing threw up another anomaly. Rather than test "!=" I thought I would test "not in", but contrived so that there is only one value in the "not in" list. For the NULL allowed column the Optimizer chose a full table scan:
SQL> select count (*) from transaction where ref_id_n not in (select 5 from dual) ;

  COUNT(*)
----------
    999800

SQL> @xplastexec

SQL_ID  0f3nh5h11yu6p, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n not in (select 5 from dual)

Plan hash value: 297680891

-----------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |       |       | 23425 (100)|          |
|   1 |  SORT AGGREGATE     |             |     1 |     4 |            |          |
|*  2 |   FILTER            |             |       |       |            |          |
|   3 |    TABLE ACCESS FULL| TRANSACTION |   998K|  3900K|  3578   (1)| 00:00:43 |
|*  4 |    FILTER           |             |       |       |            |          |
|   5 |     FAST DUAL       |             |     1 |       |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))

While for the non-NULL column it used the index:
SQL> select count (*) from transaction where ref_id_nn not in (select 5 from dual) ;

  COUNT(*)
----------
    999900

SQL> @xplastexec

SQL_ID  fw2svuam6vzb1, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn not in (select 5 from dual)

Plan hash value: 1550919231

---------------------------------------------------------------------------------------------
| Id  | Operation              | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                    |       |       | 20422 (100)|          |
|   1 |  SORT AGGREGATE        |                    |     1 |     4 |            |          |
|*  2 |   FILTER               |                    |       |       |            |          |
|   3 |    INDEX FAST FULL SCAN| IX1_TRANSACTION_NN |   998K|  3900K|   575   (2)| 00:00:07 |
|*  4 |    FILTER              |                    |       |       |            |          |
|   5 |     FAST DUAL          |                    |     1 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))
So I think that this example does show that there are times when the Optimizer will not use an index on a column allowing NULL values, but would otherwise use an index if NULL's were not allowed.

NULL as "Anything"

Another pattern I have seen is where a NULL value is used to mean "any value", rather than "no value". So in a system that involves processing transactions according to type, and each type is processed by a different handler, then a NULL value means that the transaction can be processed by any available handler. This leads to a query similar to the following:
select ... from transaction where (ref_id_n = 5 or ref_id_n is NULL) ...
The execution plan for this cannot use the index because NULL values are not stored in it. Hence you get an execution plan of
SQL_ID  4tsn8vbt3s3gh, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5 or ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3580 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |   101 |   404 |  3580   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("REF_ID_N"=5 OR "REF_ID_N" IS NULL))
However, if the database design was different, and the value '0' used as a special case for "any handler", with a corresponding row inserted into the "reference" table, then the query could be changed to the following and Oracle would use the index on the column:
SQL_ID  d2v9fgscm8nrq, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn = 5 or ref_id_nn = 0

Plan hash value: 1215059433

-----------------------------------------------------------------------------------------
| Id  | Operation          | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                    |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                    |     1 |     4 |            |          |
|   2 |   INLIST ITERATOR  |                    |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| IX1_TRANSACTION_NN |   201 |   804 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(("REF_ID_NN"=0 OR "REF_ID_NN"=5))
Again, the presence of NULL values results in sub-optimal query execution, and a different design can provide a better solution, allowing the Optimizer to use an index for much faster access (cost of 4 versus 3580).

Conclusion

For now I think I have written enough in one post about NULL's and indexes. My main point was to provide some real evidence that NULL values are not stored in B-Tree indexes in Oracle, and that if NULL values are allowed in columns it can affect whether the Optimizer will use an index or not. And I think I have done that. I'm sure that people experienced with Oracle already know this, but I just wanted to provide some proof points for anybody who doubted it for any reason. As ever, the benefit of some real tests is that you get to either verify any assumptions you have made or show them up to be wrong and correct them. In this case I have been able to correct some weak assumptions I had, and have learnt some more about how the Oracle Optimizer works, specifically when handling NULL values.

Table Creation SQL

Here is the SQL to create the tables, populate them with data and index them. Note the two foreign key columns in "transaction", and that one value is mapped to NULL when generating the data.
--
-- Create the tables necessary for the NULL values investigation
-- Create tables, load in generated data, index, gather statistics
--
prompt Creating tables
--
-- Reference table - a lookup of master values
--
create table reference (
ref_id number (*,0) not null,
group_id number (*,0) not null,
description varchar2 (512) not null,
constraint pk_reference primary key (ref_id)
) ;
--
-- Transaction table
--
create table transaction (
trans_id number (*,0) not null,
ref_id_nn number (*,0) not null,
ref_id_n number (*,0) ,
location_id number (*,0) not null,
padding varchar2 (512),
constraint pk_transaction primary key (trans_id)
) ;
--
alter table transaction add constraint fk_ref_nn 
foreign key (ref_id_nn) references reference (ref_id) ;
--
alter table transaction add constraint fk_ref_n
foreign key (ref_id_n) references reference (ref_id) ;
--
prompt Loading data
--
-- SPECIAL DUMMY ROW!! Keep in sync with any newly added columns!*!
insert into reference values (0, 0, 'Unknown - Dummy entry for referential integrity purposes') ;
--
insert into reference
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 1000),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' || to_char (r, '00000000')
  from generator g
where g.r <= 10000
/
commit ;
--
insert into transaction
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 10000),
    decode (mod (r, 10000), 0, null, mod (r, 10000) + 1),
    mod (r, 1999),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' || to_char (r, '000000000')
  from generator g
where g.r <= 1000000
/
commit ;
--
prompt Creating indexes
--
create index ix1_transaction_nn on transaction (ref_id_nn) ;
create index ix1_transaction_n  on transaction (ref_id_n) ;
--
prompt Gathering Statistics
--
exec dbms_stats.gather_schema_stats ('JB') 
--
So there are 10,000 rows in Reference, and 1,000,000 rows in Transaction. This means that there are 100 rows per Reference Identifier value in Transaction.

Sunday 10 June 2012

Avoiding NULLs

On the one hand I understand the added value of the concept of a "NULL" value - meaning that no value has been stored in a column in a row in a database table. On the other hand I have always felt that allowing NULL values in a column leads to more trouble of one form or another down the line. And recent experience with another database design has again reinforced my concerns. For another description about some of the problems with NULL values and what they do or do not mean, you can also read Much Ado About Nothing?, which is quite informative.

If I had to summarise my position on allowing NULL values, then I think of them similar to how I think of a loaded gun - not directly harmful itself, but a there is a reasonable chance of it going wrong and resulting in serious damages. If you need to use them, and you have a clear reason why, then use them. But otherwise avoid them as much as possible.

Allowing NULL values in a column can be bad for all kind of direct and indirect reasons.
  • Oracle does not store NULL values in normal B-Tree indexes, which limits when the indexes can be used on columns that allow NULL values
    • Such an index cannot be used to satisfy an "IS NULL" query constraint
    • The index also cannot be used to satisfy a "column != value" query constraint
  • Foreign keys that allow NULL are "optional", but again have index and optimizer side effects.
  • Outer joins may become necessary, which again limit the choices available to the Optimizer.
  • Many programming languages have no concept of NULL, being a value outside of the normal value range by definition.
    • Either NULL values require special programming, or require some kind of shadow boolean variable to indicate if it is really NULL (no real value in the main variable).
  • Many programmers struggle with the difference between NULL and empty or blank, which can all mean different things within a programming language e.g. a zero length string, or the number zero.
  • NULL cannot be using in normal equality and range tests. As it is outside the normal value range it cannot be greater than or less than a real value. And it cannot be equal to any other value in a column, including another NULL. You must use the special constraint "IS NULL" to test for a NULL value, and never "= NULL".
  • Calculations involving a NULL value evaluate to NULL i.e. unknown. 
So NULLs are definitely useful, but each case should be considered carefully before using them.  Generally it is easier to not allow NULL values in columns, as all operations are easier.  Allowing NULL values requires some thought to make sure that it ends up working as you hoped it would.