Wednesday, 20 June 2012

NULLs & Multi-Column Indexes

Previously I showed some of the issues with allowing NULL values in columns and how indexes might be ignored by the Optimizer. All of the examples I gave involved single column indexes only, and showed that NULL values are not stored in such an index. As others have pointed out in comments, this is not necessarily true for multi-column indexes. If at least one of the columns in the index does not allow NULL values then an index entry is stored with the values of all of the indexed columns. This means that a NULL value can be stored in a multi-column index. Lets test that through an example.

Previously we saw that if we had an index on the ref_id_n column in the transaction table, which allowed NULL values, then the Optimizer would not use the index for an "IS NULL" constraint:
SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)
The index on ref_id_n is being ignored, as the NULL values are not stored in that index. However, if we create another index, with ref_id_n as the first column followed by another column that does not allow NULL values, then the index will contain entries where ref_id_n is NULL. With such an index, the Optimizer can now use it for an "IS NULL" constraint:
SQL> create index ix2_transaction on transaction (ref_id_n, ref_id_nn) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null ;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  8tubzdty7vdnv, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 176942238

-------------------------------------------------------------------------------------
| Id  | Operation         | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                 |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE   |                 |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX2_TRANSACTION |     1 |     4 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N" IS NULL)
So multi-column indexes can store NULL values in them, providing at least one column does not allow NULL values. The benefit of such an index can be significant - in this test case the cost came down from 3,578 to 2, simply because the number of NULL rows were so few. But an index on only the ref_id_n column itself is of no use for this query, and is ignored by the Optimizer.

Potentially you can also gain some benefit from an index where the ref_id_n column is not the leading index, as the index may be smaller in size than the table, and the Optimizer may chose an Index Full Scan rather than a Full Table Scan. And that is the case with the test data set I have been using:
SQL> drop index ix2_transaction ;

Index dropped.

SQL> create index ix2_transaction on transaction (ref_id_nn, ref_id_n) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  2j64r2n1nq4xm, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 1095380460

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |       |       |   726 (100)|          |
|   1 |  SORT AGGREGATE       |                 |     1 |     4 |            |          |
|*  2 |   INDEX FAST FULL SCAN| IX2_TRANSACTION |     1 |     4 |   726   (1)| 00:00:09 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)

The Index Fast Full Scan is costed at 726, compared to the 3,578 of the Full Table Scan. Being a "count (*)" only, no other data columns are needed from the table itself.

So if you are allowing NULL values in columns within your database design, and you want to find those rows that have a NULL value stored, then you cannot use an index on just that column alone. You will need a multi-column index, and include another column that does not allow NULL values. However, it may be that the only column that does not allow NULL values in your database design is the primary key column(s) itself, if you simply allow NULLs for every column by default.

2 comments:

Franck Pachot said...

Hi,

If you don't want to include another (non null) column, then you can simply add a constant, as in:

create index ix2_transaction on transaction (ref_id_n, 0) ;

this is just one additional byte in the index.

Regards,
Franck.

Jeffrey Kemp said...

"So multi-column indexes can store NULL values in them, providing at least one column does not allow NULL values."

You're absolutely correct - in the scenario where at least one column has a NOT NULL constraint, the index will store an entry for every row in the database, even if a row has some NULLs in the other (NULLable) columns.

You might want to note also, however, that in the scenario where a multi-column index is across columns that are ALL NULLable (i.e. there are no NOT NULL constraints), you can still get NULLs stored in the index. This is because the index will store entries for each row where at least one of the columns happens to have a non-NULL value. Of course, in this scenario it's not guaranteed that EVERY row in the database will be represented in the index.

So, it's more accurate to say that all indexes (whether single or multi-column) will store entries for every row that has at least one non-NULL value for a column in the index. It doesn't matter whether the columns have NOT NULL constraints or not.