Wednesday 30 March 2016

Full Table Scan not always as low as 0.5% of data!

Based on a reply from Jonathan Lewis to an OTN post on Explain Plans I need to correct some of the claims made in my previous post on Full Table Scans.

I'm going to repeat the inaccurate paragraphs and the Conclusions I posted then [edited here for brevity], and then point out the mistakes in them, and an attempt at a corrected set of paragraphs. I will also edit the original post and replace these incorrect paragraphs with the corrected ones.

I'll highlight the parts that are not strictly correct.
The main determinant for whether a FTS is the best access method is the fraction or percentage of the rows in the table being retrieved by the query, which are of course determined by the filter conditions in the query. If this fraction / percentage of data from the table is high enough then a Full Table Scan will always be a lower cost than using an Index Scan (except maybe for an Index Fast Full Scan, which is really another variation on a Full Scan).

So when you see a Full Table Scan in an execution plan you should check the query itself for how many rows it needs from the total number of rows in the table i.e. what percentage or fraction. If this is high enough then a FTS is indeed the "best" access method to get that data because it has the lowest cost. In which case you should really be asking why the query needs such a high percentage of the data from one table. Don't assume that the Optimizer is wrong - normally it isn't. It is just as likely to be your query that is affecting the execution plan chosen.

Conclusions

Although a Full Table Scan can seem a "brute force" approach to finding some matching records in a table, it can sometimes be the better way of doing it though. It all depends on how many rows you want back from the table as a percentage of the rows in the table.

If your query is retrieving more than about 0.20% to 0.35% of the data in a table then a Full Table Scan may well be the cheapest and best access method. [...]. Trying to force the Optimizer into using another access method in this circumstance is a waste of time, because all other access methods will be more expensive. The only exception might be another Full Scan type access, such as an Index Full Scan. But even then the gains (reduction in cost) will only be marginal i.e. not a full order of magnitude less.

When you see a FTS in a query execution plan you should check the estimated row count and what this is as a percentage of the row count in the table, and confirm if this estimate is correct or not. If the estimate is correct then a FTS is the lowest cost access method and the Optimizer is right to choose it. You should also check if your query is correct, or if there is something wrong with the filter conditions in it.
The main error is that it is wrong for me to claim any kind of actual figures for the percentage when the cutover will occur. As Jonathan points out in his reply on OTN it is possible for an index access to have a lower cost than a FTS for 24.5% of the data in a table, which is far above the 0.35% figure I mention. The actual cutover percentage is dependent on so many variables that it is incorrect for me to try and state a specific range for the cutover.

While I was not necessarily wrong in my specific examples because I made clear what my assumptions were, I was also not right in all possible scenarios. So I'm withdrawing my original Conclusions and updating them to be more strictly correct.
[paragraph deleted]

So when you see a Full Table Scan in an execution plan you should check the query and all other relevant factors to see if a FTS is indeed the "best" access method to get that data because it has the lowest cost. [deleted sentence] Don't assume that the Optimizer is wrong - normally it isn't. It is just as likely to be your query that is affecting the execution plan chosen.

Conclusions

Although a Full Table Scan can seem a "brute force" approach to finding some matching records in a table, it can sometimes be the better way of doing it though. It depends on several factors including how many rows you want back from the table, the number of blocks for the table, and the Clustering Factor of any indexes.

It is possible that for even low percentages of data being retrieved from a table that a Full Table Scan can be the cheapest and best access method. [...]. Trying to force the Optimizer into using another access method in this circumstance is a waste of time, because all other access methods will be more expensive. The only exception might be another Full Scan type access, such as an Index Full Scan. But even then the gains (reduction in cost) will only be marginal i.e. not a full order of magnitude less.

When you see a FTS in a query execution plan you should check many things including, but not limited to, the number of rows in the table, the number of blocks used for the table, the Clustering Factor for each possible index, and the estimated row count for the filters being used.

If the estimate is correct then a FTS is the lowest cost access method and the Optimizer is right to choose it. You should also check if your query is correct, or if there is something wrong with the filter conditions in it.

Counter Example

It is relatively simple to show a similar query to the ones I was using that uses an index instead of a full table scan for a far higher percentage of data in the table than 0.5%. Jonathan posted one such counter example in one of his replies on OTN, and we can easily replicate this using the same test data set from my original post.

select sum (one) from scantest where pkid between 1000000 and 2000000 ;

  SUM(ONE)
----------
   1000001

Statistics
----------------------------------------------------------
      10468  consistent gets
      10311  physical reads
   2  SQL*Net roundtrips to/from client
   6  sorts (memory)
   0  sorts (disk)
   1  rows processed

select sum (one) from scantest where pkid between 1000000 and 2000000

Plan hash value: 40453105

-----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |       |       | 10282 (100)|          |
|   1 |  SORT AGGREGATE                      |              |     1 |     9 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST     |  1000K|  8789K| 10282   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PKID |  1000K|       |  2223   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PKID">=1000000 AND "PKID"<=2000000)
Note:
  • This query is retrieving 10% of the data in the table
  • An index range scan is being used to retrieve the data
  • A Full Table Scan had a cost of just over 22,000, so this index scan is much cheaper at 10,282
  • The key difference is the Clustering Factor of the pkid column
This index range scan is still cheaper when retrieving 20% of the data in the table:

select sum (one) from scantest where pkid between 1000000 and 3000000 ;

  SUM(ONE)
----------
   2000001

Statistics
----------------------------------------------------------
      20762  consistent gets
      20697  physical reads
   2  SQL*Net roundtrips to/from client
   6  sorts (memory)
   0  sorts (disk)
   1  rows processed

select sum (one) from scantest where pkid between 1000000 and 3000000

Plan hash value: 40453105

-----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |       |       | 20562 (100)|          |
|   1 |  SORT AGGREGATE                      |              |     1 |     9 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST     |  2000K|    17M| 20562   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PKID |  2000K|       |  4442   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PKID">=1000000 AND "PKID"<=3000000)

Conclusion?

Be careful of generalising and summarising certain observed behaviours too much with Oracle. When considering index usage there is much more to it than just the number of rows being retrieved as a percentage of those in the table.

No comments: