Thursday, 6 December 2007

Some Basics

This blog will be about anything to do with the performance of computer systems, especially when running relational database products. In fact the majority of it will be about the Oracle database on multi-processor UNIX systems (such as Solaris and Linux), and how to measure and monitor the performance of such systems and how to tune them. There is a lot of information out there already on these topics, and I hope to add to it with a pragmatic approach suggesting real things that you can do about the performance of your systems and the applications running on them.

Before I get going and dive into deeply technical things I thought I would cover a few basic points. Don't worry, I don't want to repeat a lot of stuff that is already out there in numerous publications. I'll just cover some of the things that I feel are most important. And I do want to get straight what I mean by performance, and how I go about any performance analysis exercise.

Nowadays we use the term "performance" as if it were a real physical thing - a resource that we can have a quantity of. We talk of "enough performance" and "needing more performance". And in the same way as for physical things, we need to be able to measure the performance of a system. As the saying goes - you cannot control what you do not measure - which in turn means that if you can't measure it you can't improve it.

Adequate performance really means that you are able to achieve your goals. And poor performance means that your goals are not being met. So all performance related work starts with goals. Without goals you cannot truly say whether performance is good or bad, or how close you are to achieving what is needed.

Performance analysis needs to be done in an objective fashion, with well defined goals, which can be measured. No goals, means no measurements, means no ability to determine where you are. Everything becomes subjective - "too slow" - and impossible to pin down. Often relative terms are used without a context - "slower". You have to define up front what "fast enough" means. And these goals are best defined in terms of the business requirements the system was deployed to satisfy. The higher the level you can define the goals at, the closer the result is to something that makes sense to the business and directly benefits it. A low level performance measure is often not particularily relevant to an overall business process.

Performance analysis and tuning needs to be done top-down, not bottom-up. Top down tuning focuses on those activities that have the greatest impact on the performance goal i.e. the business goal of the system. Bottom-up tuning often focuses on low-level metrics and tries to identify those that are out of bounds and fix them. Often many "rules of thumb" are used. However, there is no guarantee that this tuning effort will have any impact on the business goal at all. In my view all performance tuning actions should be linked back to the expected impact on the business goals of the system.

When analysing a system there are two ways you can look at the system's performance and the resource usage on it:
  • From a capacity perspective of how much of each resource is being used, and identify saturated resources
  • From a usage perspective of identifying which processes and tasks are consuming the resources
Ultimately both provide similar information on the system's resources that are being used. But I feel it is much more useful to know what is using a resource than to just know that the resource is being used a lot as measured by its utilization. By identifying what is the highest resource consumer you can work backwards to the part of the application responsible, and in turn identify what it is doing. This will lead to an understanding at a higher level of what the application is doing, and why, as opposed to simply investigating low level resource utilization measurements.

Performance itself can only really be measured in 3 ways:
  • Latency - the elapsed time to perform one unit of work.
  • Throughput - the amount of work done in a period of time.
  • Utilization - the amount of a fixed capacity resource that was used.
All performance measurements will be one of these types. They are measuring different characteristics of the system:
  • Latency measures the quality of the service - fast or slow?
  • Throughput measures the quantity of the service - how much?
  • Utilization measures the efficiency of the system when delivering the service - what did it cost? how much did we use?
We have had many rules of thumb about tuning Oracle systems for many years - everyone knows how to measure the "buffer cache hit ratio" - but they do not solve all performance problems. And as time goes on, they are less and less likely to be the cause of any real performance problem, as they will have been configured adequately from the beginning due to recognised best practice. This is why a top-down structured approach delivers far more benefit in the long term, than a quick and dirty low level analysis with a few rules of thumb applied.

And luckily Oracle has made significant improvements to the measurements it takes within itself between versions, which help with this top down analysis and identifying the biggest resource consumers. It too has moved from a bottom up approach of reporting low level utilisation figures, to a top down approach showing the relative proportion of resources consumed by which SQL statements and even today being able to estimate the net impact on performance of certain changes to the database.

I'll get onto specific tuning issues for both Oracle, Solaris and computers in general in future entries.

And what about me? I've been using Oracle since about 1988 (version 5), but really got into it with version 6 and every version since then. Since starting in application development (C on UNIX) many years ago, I then moved into technical support and more lower level details of how systems actually work and achieve what they do behind the scenes. And this led onto benchmarking and performance tuning and problem analysis of Oracle based systems, which is what I have continued to work on since about 1990, one way or another. I keep trying to expand and move into new fields and technologies, but Oracle and performance issues never seem to go away, and I keep coming back to them time and time again. I have worked for a number of computer hardware manufacturers (including Sequent and Sun) and in a variety of benchmarking, consultancy and delivery roles, often a mix of architect / design work and hands on implementation. This has given me exposure to some of the largest Oracle based systems in the world in their time, and to the latest versions of Oracle and tuning the system to squeeze out the last drop of performance. No matter how good the technology gets, people still keep getting performance problems. And it looks like they won't go away in the future too, which will hopefully keep me busy.

No comments: