Tuesday 21 October 2008

Staging Tables and Statistics in Oracle

As we should all know, Oracle 10g only has a cost based optimiser for executing SQL statements, and all of its decisions are based on the statistics it has about the tables in the database. Generally this works well, and there is also a default job that runs each day to keep these statistics up to date. Where there can be a problem though is with tables that are used to stage intermediate results during a series of processing steps.

Such tables often start off empty, are loaded with data produced by the first steps in processing, and then read and processed by the subsequent steps. Eventually the data will be deleted from the staging table as its processing is finished. Because the statistics on these tables indicate that they are empty or near empty, Oracle will simply scan them rather than use any index. When the data set being processed is small, this mistake can go unnoticed, with little effect on the overall performance and elapsed time. However, as the data set being processed gets larger and larger so the effects of the wrong execution plan become larger too, and the elapsed time grows and grows.

The solution of course is to update the statistics on such a staging table after the data has been loaded into it, so that Oracle can make the right decisions. There are a number of ways to do this, which I'll get onto, and you can also find out elsewhere on the web easily enough.

The other thing to be aware of in this situation is temporary tables - global temporary tables as Oracle calls them - where the data in the temporary table is session private and is deleted when the session disconnects. These temporary tables are also often used to stage data that is being processing in a series of steps, and have certain potential advantages - data is private to each session, and the data is automatically deleted when the session disconnects.

However, there are issues with collecting statistics on such temporary tables, which you must be aware of. The net result is that it is basically impossible to collect statistics on such temporary tables. You have to use another mechanism to set the statistics on a temporary table.

For normal data tables you can use the GATHER_TABLE_STATS procedure in the DBMS_STATS package to gather up to date statistics on the table after the data has been loaded into it. By leaving other options to their default values, Oracle should also update the statistics on the indexes too.

If you want to avoid the elapsed time associated with the full GATHER_TABLE_STATS you can instead run DELETE_TABLE_STATS. In such situtations when the Oracle optimiser has no statistics on a table it samples the table itself to make some estimates for those statistics. This is termed dynamic sampling. In principle this ensures that the optimiser has some statistics on the table that reflect its size and data distribution. Again this will not work as intended for a temporary table.

And finally you can use SET_TABLE_STATS to explicitly set statistics such as the row count of the table. This avoids the gathering of statistics associated with either full statistics or dynamic sampling, and works for temporary tables too.

Although this may not seem earth shattering, having correct and valid statistics is critical to the way Oracle 10g works. And poor performance and wrong behaviour from Oracle when executing queries can often be due to poor or incorrect statistics. Often developers can forget about this, assuming that the overnight job will keep the statistics up to date - a database administrator's responsibility. But these temporary staging tables are something that change so frequently that they can never have up to date or correct statistics on them, no matter what the database administrators do. Which in turn means that they must be dealt with explicitly within the application code, to somehow set the statistics to suitable values using one of the mechanisms just described. Only then will Oracle access the data in these tables in an optimal way.