Friday 7 December 2007

Oracle SQL Optimization and Variables

I was going to get around to writing something up about the Oracle Optimizer in the future, and how it can sometimes choose non-optimal plans for what it thinks are perfectly logical reasons. But suddenly today I find that the optimizer development group inside Oracle have done their own post on this. So I thought I would point you at it as an excellent description of the issues I was going to discuss, and so that you can see that in 11g Oracle have finally got around to addressing a problem caused by previous enhancements to the optimizer.

What I'd add is that I see the nature of the problem of these non-optimal SQL execution plans as being one of the consequences of previous attempts by Oracle to improve performance for the majority of SQL queries. Although most queries have indeed benefited from better and faster execution plans, some have not and have actually got worse between versions of Oracle.

Originally there was no Cost Based Optimizer (CBO) and only the Rule Based Optimizer, which was unaware of even things like table sizes. So Oracle added in the CBO and it used some simple data on table sizes and indexes on columns to make better execution decisions. Over time Oracle added more and more features to the CBO to improve the performance of most SQL statements most of the time, and added extra sets of statistics to make more accurate predictions about query execution behaviour. Unfortunately, while the vast majority of queries have indeed benefited from these changes to the way the CBO works, there have always been examples of queries that have suffered and actually got worse. The example given in the post by Oracle is just one. I know of many others.

The problem is really due to two issues, now solved by 11g, hopefully. The shared execution plan in the library cache, means that the same execution will always be used for the same SQL statement since it was first parsed ('same' meaning identical text of the SQL statement). Initially this was okay. But then Oracle introduced bind variable peeking, to try and optimize the query even further. By looking at the actual value supplied for a bind variable in a query, and using statistics on the value distribution of the queried column, Oracle could better optimize that particular query. But now all subsequent executions of the same query, regardless of the value used in the bind variable, would be forced to use the same, original execution plan. Which may not be optimal for those subsequent query executions.

There is in fact another, more manual, solution to this problem. Do not use bind variables in queries you know are sensitive to the query values used, and for which you want distinct execution plans. By putting the query value itself directly into the SQL statement, Oracle will treat each of them as distinct SQL statements and parse and optimize them individually, each with their own entry and associated execution plan in the library cache. Unless, of course, you have enabled forced cursor sharing, where Oracle dynamically rewrites SQL statements to remove hard coded values and replace them with references to bind variables. Which would simply map them all to the same, single SQL statement and execution plan in the library cache. Something to be aware of.

I've recommended this solution a few times to customers with data sets that have only a few values and very skewed distributions, and queries that do not strongly constrain on other columns. The example in Oracle's post is one such case - no other constraints other than on department, which is highly skewed. These cases are very few and far between, but often they do occur in critical pieces of code e.g. pick the next request to process that is in a 'ready to run' state. The longer such a query takes, the more time that is wasted before the application gets on with processing the request itself. And this can exasperate things when picking the request ends up taking longer than processing the request itself.

It is nice to see that Oracle have got around to addressing this problem of queries that are sensitive to the constraining values used in the query.

Thursday 6 December 2007

Some Basics

This blog will be about anything to do with the performance of computer systems, especially when running relational database products. In fact the majority of it will be about the Oracle database on multi-processor UNIX systems (such as Solaris and Linux), and how to measure and monitor the performance of such systems and how to tune them. There is a lot of information out there already on these topics, and I hope to add to it with a pragmatic approach suggesting real things that you can do about the performance of your systems and the applications running on them.

Before I get going and dive into deeply technical things I thought I would cover a few basic points. Don't worry, I don't want to repeat a lot of stuff that is already out there in numerous publications. I'll just cover some of the things that I feel are most important. And I do want to get straight what I mean by performance, and how I go about any performance analysis exercise.

Nowadays we use the term "performance" as if it were a real physical thing - a resource that we can have a quantity of. We talk of "enough performance" and "needing more performance". And in the same way as for physical things, we need to be able to measure the performance of a system. As the saying goes - you cannot control what you do not measure - which in turn means that if you can't measure it you can't improve it.

Adequate performance really means that you are able to achieve your goals. And poor performance means that your goals are not being met. So all performance related work starts with goals. Without goals you cannot truly say whether performance is good or bad, or how close you are to achieving what is needed.

Performance analysis needs to be done in an objective fashion, with well defined goals, which can be measured. No goals, means no measurements, means no ability to determine where you are. Everything becomes subjective - "too slow" - and impossible to pin down. Often relative terms are used without a context - "slower". You have to define up front what "fast enough" means. And these goals are best defined in terms of the business requirements the system was deployed to satisfy. The higher the level you can define the goals at, the closer the result is to something that makes sense to the business and directly benefits it. A low level performance measure is often not particularily relevant to an overall business process.

Performance analysis and tuning needs to be done top-down, not bottom-up. Top down tuning focuses on those activities that have the greatest impact on the performance goal i.e. the business goal of the system. Bottom-up tuning often focuses on low-level metrics and tries to identify those that are out of bounds and fix them. Often many "rules of thumb" are used. However, there is no guarantee that this tuning effort will have any impact on the business goal at all. In my view all performance tuning actions should be linked back to the expected impact on the business goals of the system.

When analysing a system there are two ways you can look at the system's performance and the resource usage on it:
  • From a capacity perspective of how much of each resource is being used, and identify saturated resources
  • From a usage perspective of identifying which processes and tasks are consuming the resources
Ultimately both provide similar information on the system's resources that are being used. But I feel it is much more useful to know what is using a resource than to just know that the resource is being used a lot as measured by its utilization. By identifying what is the highest resource consumer you can work backwards to the part of the application responsible, and in turn identify what it is doing. This will lead to an understanding at a higher level of what the application is doing, and why, as opposed to simply investigating low level resource utilization measurements.

Performance itself can only really be measured in 3 ways:
  • Latency - the elapsed time to perform one unit of work.
  • Throughput - the amount of work done in a period of time.
  • Utilization - the amount of a fixed capacity resource that was used.
All performance measurements will be one of these types. They are measuring different characteristics of the system:
  • Latency measures the quality of the service - fast or slow?
  • Throughput measures the quantity of the service - how much?
  • Utilization measures the efficiency of the system when delivering the service - what did it cost? how much did we use?
We have had many rules of thumb about tuning Oracle systems for many years - everyone knows how to measure the "buffer cache hit ratio" - but they do not solve all performance problems. And as time goes on, they are less and less likely to be the cause of any real performance problem, as they will have been configured adequately from the beginning due to recognised best practice. This is why a top-down structured approach delivers far more benefit in the long term, than a quick and dirty low level analysis with a few rules of thumb applied.

And luckily Oracle has made significant improvements to the measurements it takes within itself between versions, which help with this top down analysis and identifying the biggest resource consumers. It too has moved from a bottom up approach of reporting low level utilisation figures, to a top down approach showing the relative proportion of resources consumed by which SQL statements and even today being able to estimate the net impact on performance of certain changes to the database.

I'll get onto specific tuning issues for both Oracle, Solaris and computers in general in future entries.

And what about me? I've been using Oracle since about 1988 (version 5), but really got into it with version 6 and every version since then. Since starting in application development (C on UNIX) many years ago, I then moved into technical support and more lower level details of how systems actually work and achieve what they do behind the scenes. And this led onto benchmarking and performance tuning and problem analysis of Oracle based systems, which is what I have continued to work on since about 1990, one way or another. I keep trying to expand and move into new fields and technologies, but Oracle and performance issues never seem to go away, and I keep coming back to them time and time again. I have worked for a number of computer hardware manufacturers (including Sequent and Sun) and in a variety of benchmarking, consultancy and delivery roles, often a mix of architect / design work and hands on implementation. This has given me exposure to some of the largest Oracle based systems in the world in their time, and to the latest versions of Oracle and tuning the system to squeeze out the last drop of performance. No matter how good the technology gets, people still keep getting performance problems. And it looks like they won't go away in the future too, which will hopefully keep me busy.

Wednesday 26 September 2007

New Blog

This is a new blog for posts about relational database performance on UNIX multi-processor systems - mainly Oracle, but possibly others too. I've done a lot with Oracle on large systems over the past 15 years (mainly Sequent and Sun servers), and will post various things I continue to learn as a database developer.