Tuesday 4 March 2008

Gathering Performance Statistics - What and When

In performance analysis and tuning "Measurement is Everything". Or to put it another way - "What you do not measure you cannot control". So we want to measure our system and what is happening on it. Doing this all the time provides a baseline to which we can compare should performance suddenly change. Which leaves the questions of:
  • What should we be measuring?
  • How often should we be measuring it?
In principle the more we measure the better, as we have a finer grained level of data to analyse, and nothing is lost. We can always summarise this low level data in various ways during the initial analysis steps.

However, collecting too much data too often can itself end up being a significant workload on the system. "Nothing comes for free". So we may need to restrict how much data we collect, and how often we collect it, to minimise the impact on the system. It would also be good if within the data we collected we were able to identify the workload of the measurement collector itself.

In terms of frequency I believe that no more than every minute is generally needed. This provides an adequate level of detail for profiling and establishing a baseline, and provides 60 data points per hour. More frequent measurements provide more data points, but not necessarily anything more useful for analysis purposes. And the measurement collector itself may become a significant workload on the system. Less frequent measurements quickly reduce you to 20 or less data points per hour, which I believe is too few.

Of course there are some caveats and assumptions to this 'per minute' rule of thumb I use:
  • The workloads on the system have a lifespan of significantly longer than one minute. Thus we will have multiple data points covering the lifespan of each workload on the system. If the workloads are shorter than a minute to arrive and complete, then you should investigate a higher frequency of measurement, subject to the load of collecting the measurements themselves.
  • Collecting the measurements is relatively quick and a much lighter load on the system than anything else i.e. negligible. The collection should be a read only activity and then saving the measurements rather than a complicated set of processing to arrive at the measurements. If this is not true, and the collection involves significant system resources, then it should be done at a lower frequency. Oracle STATSPACK is an example of this, while Oracle's AWR is a much lighter weight alternative in Oracle 10g onwards.
  • The volume of data is not too great, and the measurements change in value between samples. If these are not true then the frequency should be reduced and measurements collected less often. Or break the measurements down into different sets - some collected more frequently than others.
In terms of what data to collect, I believe you should collect measurements from all levels of the stack. A computer system is not just one thing (the application), but a stack of things all layered one on top of the other. The main layers of the stack include:
  • Hardware - Processor (CPU), Memory, Disk, Network
  • Operating System - Abstracts hardware to standard services and interfaces
  • Database software - Implements persistant transaction oriented data store
  • Middleware - Provides various facilities such as application servers, containers, connection pools, etc
  • Application - Contains the business logic
  • User Interface - Renders graphical user interface and interacts with application. May be separate or integrated with Application itself.
By measuring performance at all levels of the stack you gain a number of benefits:
  • Measuring performance at the Application / User level gives you a meaningful and true business measure of performance e.g. orders processed
  • Measuring performance at other levels of the stack lets you see if any individual component or resource is overloaded
  • Measuring performance at all levels lets you correlate changes in activity at one level with changes at another level. This correlation helps you identify cause and effect from one level to another. Though you may need to investigate further to prove that you have a true cause and effect link.
Then at each layer of the stack, you should collect as comprehensive a set of performance measurements as you can, subject to the earlier caveats about not overloading the system. This is because of the adage that if you do not collect it, you cannot analyse it.

One of the worst situations is to have a performance problem in the future, and find that a key measurement of performance data that would indicate the nature of the problem is not being collected. And although you might be able to add this extra measurement in and collect it from now on, you do not have it in your history to compare to and establish how much it has changed, if at all.

With all this data being collected, it will soon amass to a large volume on disk, and needs to be managed. The basic principles are:
  • Keep all collected measurements for a fixed period, to allow post analysis of reported problems
  • Beyond this period you can combine summarising the data in various ways to reduce its volume, and deleting it
  • Freeze and keep a period as a baseline for reference purposes, and comparison to any abnormal behaviour
  • Multiple baselines can be established and kept for different workload profiles
In summary:
  • Collect as much as you can, at a reasonable frequency.
  • Breadth (many separate measurements) is more important than depth (frequency of collection)
  • Collect at all levels of the stack to allow a holistic analysis and identify overloaded resources
  • Manage the historical measurements, retaining them for a period of time
  • Representative periods can be frozen and kept forever, and others deleted on a rolling basis