Monday, 21 March 2016

Creeping Slow Performance

A recent OTN post requesting help for a slow performing update where Oracle was doing a full table scan, also revealed that the update was getting slower and slower each month, and that new data was loaded into the table each month to be "processed" i.e. updated. Unfortunately this kind of design suffers from a natural creeping slowdown in the elapsed time of the update process. Each month it will take longer and longer to complete the processing of the newly loaded data, until any elapsed time targets are exceeded and it continues to take longer and longer to complete. There is a bad design pattern here, but obviously the original designers and implementers did not spot it.

Consider the following - in the first month after go-live the table only contains newly loaded data, so a full table scan is used. Let's say it takes 2 minutes to complete, and our target is 30 minutes for the whole process to complete, which includes other steps as well. Everything looks good performance wise, so we leave everything as it is and carry on.

The next month - month 2 - the size of the table has doubled, and Oracle still uses a full table scan because you want to update half the data in the table. This now takes 4 minutes because the table has doubled in size. But the overall process still completes within 30 minutes, so everything still looks good.

Over the next few months the table grows each month, and the elapsed time of the update continues to increase by 2 minutes each month, assuming a similar volume of data is loaded in each month. After 5 months the update now takes 10 minutes, but this is still within our 30 minute target.

But after 10 months the elapsed time is now 20 minutes, which is a significant proportion of the 30 minutes, and people are starting to ask questions. And after 15 months, if not sooner, the 30 minute target has been missed and the overall data load and process is now taking too long. Not only that, it is continuing to take longer and longer to complete each month, and will only get worse.

What can be done? Could an index help? No, is the simple answer. As I recently blogged in Full Table Scan - Friend or Foe?, the cutover point for the Optimizer to choose to use an index is when you want to retrieve less than 0.5% or so of the data in a table. After 10 months you are still processing 10% of the data in the table each month. And worse, even after 100 months (8+ years) you are still processing 1% of the data in the table each month, and a full table scan is still a lower execution cost than using any index on the table.

What could you do about this? One obvious solution is to use a "staging table" of the same structure into which the data is initially loaded each month. Then do the update and other processing on this table, and finally insert all the data into the main table. This solution does not suffer from the constant degradation in performance that the current one does. Oracle will still use a full table scan, but it is only reading from disk the data for this month, and so should complete in a relatively constant time each month - assuming data volumes are similar each month. The insert to the main table could be done quite quickly using the "append" hint for a direct load, and then the staging table could be truncated which should be very quick. Although the data is copied around in Oracle twice, it should still be much quicker than scanning a table with hundreds of months of data in it.

Other possibilities include partitioning by a key unique to each month's data, which is also used in the update statement so that partition pruning occurs. This may be problematic if there is not a clean partition key to use. It should not be a "processed flag", as this would be updated by the processing itself which would result in row movement between partitions and extra work by Oracle.

Ultimately the issue is about spotting these kind of bad design patterns early on and avoiding these pending future performance problems. To get to one month's data being less than 0.5% of the total data volume in the table would take you over 200 months, or almost 17 years. And during those 17 years this monthly data load processing would always take longer and longer to complete. A different design can avoid this future performance problem completely.

No comments: