Tuesday, 21 October 2008

Staging Tables and Statistics in Oracle

As we should all know, Oracle 10g only has a cost based optimiser for executing SQL statements, and all of its decisions are based on the statistics it has about the tables in the database. Generally this works well, and there is also a default job that runs each day to keep these statistics up to date. Where there can be a problem though is with tables that are used to stage intermediate results during a series of processing steps.

Such tables often start off empty, are loaded with data produced by the first steps in processing, and then read and processed by the subsequent steps. Eventually the data will be deleted from the staging table as its processing is finished. Because the statistics on these tables indicate that they are empty or near empty, Oracle will simply scan them rather than use any index. When the data set being processed is small, this mistake can go unnoticed, with little effect on the overall performance and elapsed time. However, as the data set being processed gets larger and larger so the effects of the wrong execution plan become larger too, and the elapsed time grows and grows.

The solution of course is to update the statistics on such a staging table after the data has been loaded into it, so that Oracle can make the right decisions. There are a number of ways to do this, which I'll get onto, and you can also find out elsewhere on the web easily enough.

The other thing to be aware of in this situation is temporary tables - global temporary tables as Oracle calls them - where the data in the temporary table is session private and is deleted when the session disconnects. These temporary tables are also often used to stage data that is being processing in a series of steps, and have certain potential advantages - data is private to each session, and the data is automatically deleted when the session disconnects.

However, there are issues with collecting statistics on such temporary tables, which you must be aware of. The net result is that it is basically impossible to collect statistics on such temporary tables. You have to use another mechanism to set the statistics on a temporary table.

For normal data tables you can use the GATHER_TABLE_STATS procedure in the DBMS_STATS package to gather up to date statistics on the table after the data has been loaded into it. By leaving other options to their default values, Oracle should also update the statistics on the indexes too.

If you want to avoid the elapsed time associated with the full GATHER_TABLE_STATS you can instead run DELETE_TABLE_STATS. In such situtations when the Oracle optimiser has no statistics on a table it samples the table itself to make some estimates for those statistics. This is termed dynamic sampling. In principle this ensures that the optimiser has some statistics on the table that reflect its size and data distribution. Again this will not work as intended for a temporary table.

And finally you can use SET_TABLE_STATS to explicitly set statistics such as the row count of the table. This avoids the gathering of statistics associated with either full statistics or dynamic sampling, and works for temporary tables too.

Although this may not seem earth shattering, having correct and valid statistics is critical to the way Oracle 10g works. And poor performance and wrong behaviour from Oracle when executing queries can often be due to poor or incorrect statistics. Often developers can forget about this, assuming that the overnight job will keep the statistics up to date - a database administrator's responsibility. But these temporary staging tables are something that change so frequently that they can never have up to date or correct statistics on them, no matter what the database administrators do. Which in turn means that they must be dealt with explicitly within the application code, to somehow set the statistics to suitable values using one of the mechanisms just described. Only then will Oracle access the data in these tables in an optimal way.

Monday, 22 September 2008

Contention and a bottleneck - or not?

I recently had the following situation: during a heavy data load by multiple, concurrent jobs the customer said that they saw contention, and provided me with a STATSPACK report from this period. Initially I was confused:
  • On the one hand, the top wait event was a very high value for a latch wait on the cache buffers chain.
  • On the other hand, the CPU utilisation during this whole period was only around 10%.
Normally I would say that this system had capacity to spare - 90% CPU unused - but the wait figures indicated that something was wrong somewhere. Digging further into STATSPACK I verified that the top SQL statement was the INSERT into the one table, and that the number of waits correlated with the number of sleeps on the cache buffer chain latch.

Now this told me that these were actual sleeps, where the session failed to get the latch when it tries in a busy spin loop. And these sleeps are relatively expensive and time consuming. Hence the high value for the wait time experienced.

Digging further I saw that there were buffer busy waits too, and managed to establish that these were all on the same index on the table. Now this table has six indexes - i1 to i6 - and the buffer busy waits were all on index i6. And this is on a column called last_update_time. So clearly this was a sequential, monotonically increasing series of values, and would all end up in the final leaf block of the index. So we had a classic case of records with the same value, or very close values, being inserted, and causing contention on the index block where those same values would be stored. And the lock on the index block during the block split and contents copying, would be the cause of the sleeps on the cache buffer chain latch for that block.

I also did some checking on the index structure itself, by analysing the index and then looking at the statistics. This showed that the percentage of space actually used in each block averaged just over 50%. So Oracle was actually doing 50:50 index block splits when the last index leaf block became full, rather than 90:10. This is probably due to the update times not being perfectly in sequence from the different data loading processes. But this also meant that this new block was already 50% full and so would be ready to split even sooner than if it were only 10% full.

The solution? Well, there were a couple of possibilities.

One was to use a REVERSE key index. However, these are only useful for equality tests and not ranges. And date columns tend to be used for range queries (from and to).

Another option would be to partition the index by another column, which would have different values in each INSERT from each of the loading processes. This basically creates a set of separate smaller indexes, and so increases the number of last leaf blocks across these sub-indexes, and so reduces contention. However, partitioning is an extra cost option, and needs extra maintenance over the number of partitions and their configuration.

The recommended option is to modify the index and add the extra column after the last_update_time, mentioned just now in partitioning. This would mean the index was still useful for equality and range conditions, and would spread out the same last_update_time value between multiple index leaf blocks having different values for the second column - in this case essentially an account identifier. Now with this extra layer of data values in the index it will distribute the new values over more index leaf blocks, and so reduce contention and improve throughput times.

Tuesday, 4 March 2008

Gathering Performance Statistics - What and When

In performance analysis and tuning "Measurement is Everything". Or to put it another way - "What you do not measure you cannot control". So we want to measure our system and what is happening on it. Doing this all the time provides a baseline to which we can compare should performance suddenly change. Which leaves the questions of:
  • What should we be measuring?
  • How often should we be measuring it?
In principle the more we measure the better, as we have a finer grained level of data to analyse, and nothing is lost. We can always summarise this low level data in various ways during the initial analysis steps.

However, collecting too much data too often can itself end up being a significant workload on the system. "Nothing comes for free". So we may need to restrict how much data we collect, and how often we collect it, to minimise the impact on the system. It would also be good if within the data we collected we were able to identify the workload of the measurement collector itself.

In terms of frequency I believe that no more than every minute is generally needed. This provides an adequate level of detail for profiling and establishing a baseline, and provides 60 data points per hour. More frequent measurements provide more data points, but not necessarily anything more useful for analysis purposes. And the measurement collector itself may become a significant workload on the system. Less frequent measurements quickly reduce you to 20 or less data points per hour, which I believe is too few.

Of course there are some caveats and assumptions to this 'per minute' rule of thumb I use:
  • The workloads on the system have a lifespan of significantly longer than one minute. Thus we will have multiple data points covering the lifespan of each workload on the system. If the workloads are shorter than a minute to arrive and complete, then you should investigate a higher frequency of measurement, subject to the load of collecting the measurements themselves.
  • Collecting the measurements is relatively quick and a much lighter load on the system than anything else i.e. negligible. The collection should be a read only activity and then saving the measurements rather than a complicated set of processing to arrive at the measurements. If this is not true, and the collection involves significant system resources, then it should be done at a lower frequency. Oracle STATSPACK is an example of this, while Oracle's AWR is a much lighter weight alternative in Oracle 10g onwards.
  • The volume of data is not too great, and the measurements change in value between samples. If these are not true then the frequency should be reduced and measurements collected less often. Or break the measurements down into different sets - some collected more frequently than others.
In terms of what data to collect, I believe you should collect measurements from all levels of the stack. A computer system is not just one thing (the application), but a stack of things all layered one on top of the other. The main layers of the stack include:
  • Hardware - Processor (CPU), Memory, Disk, Network
  • Operating System - Abstracts hardware to standard services and interfaces
  • Database software - Implements persistant transaction oriented data store
  • Middleware - Provides various facilities such as application servers, containers, connection pools, etc
  • Application - Contains the business logic
  • User Interface - Renders graphical user interface and interacts with application. May be separate or integrated with Application itself.
By measuring performance at all levels of the stack you gain a number of benefits:
  • Measuring performance at the Application / User level gives you a meaningful and true business measure of performance e.g. orders processed
  • Measuring performance at other levels of the stack lets you see if any individual component or resource is overloaded
  • Measuring performance at all levels lets you correlate changes in activity at one level with changes at another level. This correlation helps you identify cause and effect from one level to another. Though you may need to investigate further to prove that you have a true cause and effect link.
Then at each layer of the stack, you should collect as comprehensive a set of performance measurements as you can, subject to the earlier caveats about not overloading the system. This is because of the adage that if you do not collect it, you cannot analyse it.

One of the worst situations is to have a performance problem in the future, and find that a key measurement of performance data that would indicate the nature of the problem is not being collected. And although you might be able to add this extra measurement in and collect it from now on, you do not have it in your history to compare to and establish how much it has changed, if at all.

With all this data being collected, it will soon amass to a large volume on disk, and needs to be managed. The basic principles are:
  • Keep all collected measurements for a fixed period, to allow post analysis of reported problems
  • Beyond this period you can combine summarising the data in various ways to reduce its volume, and deleting it
  • Freeze and keep a period as a baseline for reference purposes, and comparison to any abnormal behaviour
  • Multiple baselines can be established and kept for different workload profiles
In summary:
  • Collect as much as you can, at a reasonable frequency.
  • Breadth (many separate measurements) is more important than depth (frequency of collection)
  • Collect at all levels of the stack to allow a holistic analysis and identify overloaded resources
  • Manage the historical measurements, retaining them for a period of time
  • Representative periods can be frozen and kept forever, and others deleted on a rolling basis

Monday, 18 February 2008

Good SQL goes hand in hand with Good Database Design

Or to put it another way - if you find poor SQL in an application, it might be as much to do with poor database and application design than plain, old poor programming.

Ever since I had to use COBOL briefly many years ago, I've learnt the lesson that often a good design can make the programming easier. By carefully designing and defining the DATA sections at the top of your COBOL program, you could make the later manipulation of the data itself that much easier.

Time and again I have seen the results of both practices - good design leading to a good, efficient implementation; and bad or weak design leading to a poor implementation. Not only can a good design help achieve a good implementation, but a poor design will often hamper the programmer and reduce the options available to them, restricting what choices they have. And so the result of a poor design is a poor application with poor SQL in it.

The SQL to access data goes hand in hand with the design of the database containing that data. You cannot separate the application code from the database design. Put the data in the right place, and the SQL becomes simple and straightforward. Put the data in the wrong place, and the SQL becomes more complicated, joining tables together and adding extra constraints and conditions, and possibly needing to use more horrible things such as outer joins and unions. I'm not saying that these are always unavoidable, but I am saying that the core design of a database for an application should not need them.

This is why good database design is important, and how it can impact application performance as much as the application code itself. Time spent doing a good database design is repaid by a good, well performing application. A skimped, quick database design often results in a longer development cycle, probably needing changes to the database design to correct things missed out earlier, and extra effort by the developers to write the SQL that gets them the data they need when they need it.

I'm saying all this because I have seen systems with poor SQL that was causing poor performance, and it turned out to be due to the poor database design, and not the programmer's ability to write good SQL. The SQL was in fact as simple and direct as it could be, but it had to go to multiple tables and join them together and apply different restrictions to each table, to get the data it needed.

In fact, some of the worst impact SQL I have seen on a system was incredibly simple, and was written to continually count the rows in a table to see if any new records had arrived. Of course it was not that simple - only certain rows in a certain state were to be counted, and these also depended on having matching data in another table, and so on. The net result was a SELECT that queried 5 separate tables to determine if any data at all was in a 'ready' state. And because it was a count, it actually needed to touch a lot of data or index records.

Although the joins were efficient (using indexes), the constraints were very few, so that Oracle was retrieving most of the rows that matched in most of the tables, before rejecting them and counting what was left. Such a query cannot avoid a lot of row accesses, even if they are logical and not physical I/Os due to being in the buffer cache. It still takes time to access these rows in the tables and count them all up.

What made this case worse, was that for various reasons the query also did a self join. New entries could be sequenced, and so to count a record as being 'ready' it also had to have no predecessors in its sequence which were also 'ready'. This doubled the amount of work being done by the query, in terms of data records accessed.

As you can imagine, this made it a slow query. And this was being run multiple times a second to see if any new work had arrived. So it became the largest resource consumer in Oracle from the application, even though it was not doing any real work - just counting how many jobs were ready to be processed.

The solution? A redesign. We tried everything we could about using indexes and caching data, and pushing joins down into sub-queries to steer the optimiser one way or another. But still the query involved a lot of buffer gets, which took time, and it was being run very frequently. Which only left a redesign as a viable solution. By moving around where data was stored, and in what form, we were able to eliminate many of the joins, and drastically reduce the number of buffer gets per execution.

And the biggest win was to turn the design completely on its head. Instead of counting rows in a table, we had a counter column in another record that was updated every time a new record was inserted. Although this meant changing the application in more than one place (adding an UPDATE of the counter when a new record was inserted and later when processed), the original query was now reduced to a single record fetch. The remaining joins were now only joining to a single row, and not to every row in the original table, and the self join disappeared too.

The result was vastly improved performance - of both the query and the overall system - and in fact it reduced the total resource consumption on the system, so that CPU utilisation when down while throughput went up.

And all because of a poor initial database design, restricting a programmer to writing a poor piece of SQL, that performed inefficiently and used too many resources.

Tuesday, 15 January 2008

Limits, Contention and Capacity

One of the key concepts in performance tuning and analysis is that of the "bottleneck". At low levels of throughput everything looks okay - as you increase the incoming requests so the output of completed work increases at the same rate. But as the throughput of the system rises you will eventually reach a point where the output levels off and no longer increases inline with input increases. At this point some part of the system is saturated and has become a bottleneck - you cannot squeeze any more through it, even though the rest of the system has capacity to spare.

Tuning a system is often about identifying this bottleneck, and establishing that it is indeed the bottleneck rather than just a slow component in the system. Fixing the bottleneck involves somehow increasing its capacity. Bottlenecks are normally considered in terms of the hardware in the computer system, but they can also be caused by the software being run.

In my view there are only 2 fundamental ways to fix any bottleneck:
1. Increase the capacity of the resource itself i.e. buy more or faster hardware, or change the software algorithm.
2. Reduce the usage of the resource by the application, freeing up capacity.

Which action you take depends on the nature of the bottleneck. If there is no contention and it is just a slow resource, then the first option is the most viable one. The second option is best when there is contention for the resource and a backlog queue of outstanding requests is forming in front of the resource for access to it. Reducing the use of the resource will reduce the size of the queue, which in turn reduces the time any request has to wait to gain access, and so has a direct impact on performance and throughput. But, as stated, this only works when the resource is overloaded and there is contention for it, which you have to identify.

Equally, the first option of replacing the resource with a faster one can work just as well for a contended resource that is the bottleneck. Being able to process each request quicker will reduce the contention and reduce or eliminate any queue of waiting requests, increasing performance and throughput. This is why many people often take the simple, brute force approach of upgrading hardware to make a system perform faster. However, it does not guarantee to remove the contention or the queuing, just to reduce it. There is the possibility that processing requests faster just leads to even more requests being submitted at a higher rate, and in turn back to having a queue of outstanding requests. Which is why the application behaviour needs to be considered.

The second way to fix a bottleneck can actually be approached in two different ways:
1. Optimise and tune the application's configuration from the outside.
2. Change the application itself from the inside.

The first approach typically involves changing configuration settings, such as increasing the number of records to read at a time to reduce the number of data reads, or creating an extra index on a table. The second approach typically involves rewriting part of the application to use more efficient algorithms that use less resources, such as reading a set of data into an array in memory once, and doing in memory lookups for validation instead of reads from disk.

What you do is dependent on the nature of the bottlenecked resource, the application software, and how much money and time you have available.

Another key point to be aware of, is that fixing one resource simply moves the bottleneck to another one. The bottleneck never actually goes away, it just moves somewhere else within the system. Presuming you are able to increase the effective capacity of the resource that is initially the bottleneck then throughput will increase, which is good. But eventually you will reach a point where the carrying capacity of that resource is now greater than another resource in the system. And now you have to repeat the analysis of the system again to identify the new bottleneck. And again work out the best way to either increase its capacity or reduce the number of times it is used.

Which is why performance tuning is so often an iterative exercise. You fix the first problem, but the bottleneck just moves somewhere else, and you repeat the analysis again, and again.

A valid alternative is to calculate in advance how much of each resource is available in the system, and how much can be used if performance targets are to be met, such as a given number of transactions per minute. This is more of a capacity planning exercise, but is directly useful in such performance tuning situations with poor performance. We can run the application under a relatively light workload and measure the actual amount of each resource used. These measurements can then be scaled up to the target workload and compared to the calculated resource requirements from before. This will relatively easily identify which resources will be saturated and overloaded, and by how much. This is a much easier approach than the repeated iterative analysis and idenfication of bottlenecks, and also establishes how much of a gap there is between desired performance and actual achieved performance. If the gap is known to be too wide, then simple tuning may not be the answer, and may be a wasted effort. Something more drastic and radical may be required. But you can only know this by doing the capacity planning in advance of deploying the application, and then comparing the two results.