In a case like this, such an index will almost never be used by Oracle even though the index includes columns directly referenced in the WHERE clause of a query. This is because the number of estimated disk I/Os to use it will always be greater than the number of disk I/Os to do a full table scan. There might be exceptions in cases where the data is very skewed and you have a histogram on the columns in the index, and you are selecting a value with a low occurrence count. But generally such an index is unlikely to be used by any queries you run.
Which raises the question, how can we find out which indexes in our database are selective enough to be used by a query before we actually run those queries? Or conversely, which indexes are not selective enough given the data in the database and will never be used? It is a bit too late to find out when we are running a query and Oracle is doing a full table scan and taking a long time to do it. Of course we could do an EXPLAIN PLAN of the query beforehand, but execution plans can change over time dependent on the volumes of data in a table, and the distribution of data values.
We would like to have a query that we can run on a complete database, and which will tell us which indexes are not selective given the actual data in the tables. This would flag weak indexes to us, regardless of which queries might be running i.e. indexes that are unlikely to be chosen by the Optimizer even if all columns in them are referenced in the query.
How would such a query work? Well, we know how the Optimizer costs the different access methods to data in a table (see previous posts on Full Table Scans and Index Scans). We can calculate each of these for a table and an index, and see if the table scan cost is lower than the index access cost. In fact, using a bit of maths we can manipulate the formula for these two costs together in a way that results in a more direct comparison. If the maths bit is too much for you then just skip ahead to the final formula followed by the SQL to run it.
The costs of each type of access are:
- FTS Cost = (Blocks * mreadtim) / (MBRC * sreadtim)
- Index Cost = Levels + (Leaf Blocks + Clustering Factor) * Filter Factor
- Index Cost > Full Table Scan Cost
- (Leaf Blocks + Clustering Factor) * Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim)
- Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim * (Leaf Blocks + Clustering Factor))
- Filter Factor = 1 / Distinct Keys
- Distinct Keys < (MBRC * sreadtim * (Leaf Blocks + Clustering Factor)) / (Blocks * mreadtim)
To make some things easier we can group together values that are table and index independent, finally giving us that for the index to be ignored the following must be true:
- Distinct Keys < (MBRC * sreadtim / mreadtim) * (Leaf Blocks + Clustering Factor) / Blocks
For a system with no System Statistics gathered, which is actually the majority of cases, then the following calculations are used for the disk read times:
- sreadtim = ioseektim + (db_block_size / iotfrspeed)
- mreadtim = ioseektim + (db_file_multiblock_read_count * db_block_size / iotfrspeed)
This means that the only non-table dependent values in the fully expanded formula are for db_block_size and db_file_multiblock_read_count. These can only be obtained directly from the Oracle database if the user has SELECT permission on the dynamic performance views (V$PARAMETER specifically). And generally this is not true for normal users. To work around this, the SQL script below simply prompts the user to enter these two values.
The SQL script then calculates the derived values, and then uses these on all indexes on all tables to list those that fail the test i.e. number of distinct values in the index is too low.
As mentioned earlier, we can only ignore the depth of the index when the table is large enough, so the SQL includes an additional constraint on the number of blocks in a table, which is hardcoded at 100 blocks. You could easily change this to another value if desired.
This is a SQL*Plus script and must be run via SQL*Plus, as it uses SQL*Plus features such as substitution variables. And remember that your statistics must be up to date on each table and index in the database.
-- Check which indexes on all tables are better or worse than a full table scan
-- When the number of distinct values in a column is low, a full table scan
-- is more efficient (less I/Os) than using an index.
set define on
set verify off
set heading off
prompt Report to show which indexes are selective enough and which are not.
prompt Non-selective indexes will be ignored and full table scans preferred.
define small_table_threshold = 100
accept block_size prompt 'Enter database block size in bytes (e.g. 8192) : '
accept mbrc prompt 'Enter db_file_multiblock_read_count value (e.g. 8): '
prompt Using following database wide configuration settings:
prompt : Database Block size is &&block_size (bytes)
prompt : db_file_multiblock_read_count is &&mbrc
prompt : Threshold for a small table is &&small_table_threshold (blocks)
prompt : (Smaller tables than this are ignored in this report)
prompt : [If these are incorrect, edit this script and re-run]
column sreadtim new_value sreadtim
column mreadtim new_value mreadtim
column effective_mbrc new_value effective_mbrc
set termout off
-- Hide from screen internal calculations of derived values
select 10 + (&&block_size / 4096) sreadtim from dual ;
select 10 + (&&mbrc * &&block_size / 4096) mreadtim from dual ;
select &&mbrc * &&sreadtim / &&mreadtim effective_mbrc from dual ;
set termout on
prompt Assuming that system wide statistics have these derived values:
prompt : Single Block Read Time used is &&sreadtim (milliseconds)
prompt : Multi Block Read Time used is &&mreadtim (milliseconds)
set heading on
column table_name heading 'Table' format a20
column index_name heading 'Index Name' format a20
column distinct_keys heading 'Number of|Distinct|Values' format 999,999
column min_ndv heading 'Min|NDV' format 999,999
column selectivity heading 'Selectivity'
column sel heading 'Sel?'
column blocks heading 'Blocks' format 999,999
column mb heading 'Table|MB' format 999,999
select table_name, index_name, distinct_keys, min_ndv,
when (distinct_keys < min_ndv)
(blocks * &&block_size / (1024 * 1024)) mb
from ( select t.table_name, t.blocks, t.num_rows num_trows,
i.index_name, i.num_rows num_irows, i.distinct_keys,
((i.leaf_blocks + i.clustering_factor) / t.blocks) min_ndv
from user_ind_statistics i, user_tables t
where t.table_name = i.table_name
and i.num_rows != 0
where distinct_keys < min_ndv
and blocks >= &&small_table_threshold
order by table_name, index_name ;
prompt Selectivity (S) indicates whether index will be chosen or not in a query
prompt given the values of the database settings you have entered.
Running this should list indexes that will NOT generally be used by the Optimizer for any query you might run that references that table. There might be exceptional cases where the Optimizer might use such an index e.g. an Index Fast Full Scan, but generally such indexes would not be used as expected for direct access to the matching rows.
As ever, remember that higher values for db_file_multiblock_read_count lower the cost for a full table scan, and so lower the threshold at which a full table scan becomes cheaper than an index access.