Tuesday, 12 December 2017

Reading AWR Reports #1 - Start With The Problem

I hope that this will be the start of a short series of posts on how to read an AWR Report. I doubt that it will contain anything revolutionary, but it is clear that some people do not know where to start when presented with an AWR Report, so I aim to cover some essentials about this. This first post will cover what to consider before attempting to read and understand an AWR report, because ultimately AWR is just another tool and not the answer itself. I will get into the first things I look at in an AWR report in the next post.

As I've already indicated, AWR (Automatic Workload Repository) is really another tool in your toolkit for examining what happened on an Oracle database. And it is an extra cost option for the Oracle database Enterprise Edition, requiring that you purchase the Diagnostics Pack licence. Before using AWR you need to make sure that AWR is the right tool to use for the problem you have, rather than some other tool available to you. Which means that you should always start with the problem you are trying to fix, and not dive straight into using some tool or other and hope that it shows you the fault in the first screen's worth of output. You must be clear on what the problem is that you are trying to fix, the impact of that problem on users or the application or something else, and how you would measure the improvement to show that the problem has been resolved. If you don't know that information about the "problem", then you can never be sure that you have actually fixed the real cause of it - it may simply have disappeared for other reasons.

I'd strongly recommend following a top down approach to a performance problem investigation - start with the real world problem that users are experiencing, and work down and inwards from that towards the database. Do not start at the bottom directly on the database, and hope to find the problem quickly there. This is a "finding a needle in a haystack" approach. You might get lucky and find something related to the problem straight away, or you might not and end up wasting your time chasing dead ends. The "performance problem" may actually be somewhere else outside of the database, such as the client application itself, the network, or any application servers. Check and eliminate those first. Only after checking those should you move onto the database itself.

An AWR Report tells you what happened on your database between the two snapshots used. It does not tell you whether any problems did or did not occur, or what those problems might be. And it does not tell you whether your database was well behaved or not. It just provides a complete overview of everything that happened on that database between those two snapshots, and it is up to you do decide whether this meets expectations or not.

AWR reports are database wide i.e. across all the activity that happened on that database between the two snapshots. This includes activity from all sessions on all schemas in the database. So it is only really of use if the problem you are investigating is a system wide one, affecting most of the sessions and users of the database. And even then it is still not guaranteed to show you what the cause of your problem is, as it depends on what your problem is.

Furthermore, an AWR Report only tells you the grand totals of what happened over that period of time, and the averages of those (some total value divided by the elapsed time or some other measurement). It does not tell you the peak values, as it only knows the absolute change in measurements between the two snapshots being reported on. And average values over a period of time are always lower than the peak values during the most active period, because that is what an "average" means. How much lower the average values are than the peak values depends on a number of factors, such as the different time periods involved and the relative difference in activity between the peak period and the other periods.

A pitfall to avoid is just looking at the "top 10" type lists of SQL statements executed and wait events. In every database there is always a slowest SQL statement or top wait event, regardless of how active the database is. Just because that was the "slowest SQL statement" does not mean that it is a bad statement, or that it is related in any way to the problem you are investigating. You need to prove a link between the two, and not assume that they are related in any way. Trying to "improve" a "slow SQL statement" that is unrelated to the real problem users are experiencing, is just a waste of time. Check that any potential cause is really related to your problem, and not independent of it.

Rather than wade through a series of full single AWR Reports for each pair of snapshots you can instead Query the AWR data directly (again, providing you have paid for the Diagnostics Pack licence), and pull out just the key performance measurements you are interested in across a number of snapshots all at once. These might include Average Active Sessions, number of SQL statements executed, and percentage of time waiting. This can help performance investigations progress faster if you can more quickly identify the peak activity period across all the AWR snapshots available, and then drill down into only the most active period of time.

Remember that AWR provides database wide information over relatively long periods of time (between AWR snapshots), and that there are other tools out there for doing different types of performance analysis of Oracle databases:
  • Real time query and analysis of Oracle Dynamic Performance Views (V$ views), such as V$SESSION
  • Active Session History (ASH), which also requires the Diagnostics Pack license
  • SQL Trace to record all SQL statements executed by a session, and their elapsed times
Each of these have their advantages and disadvantages, and if you want to be an Oracle expert then you should be aware of these and their tradeoffs. You need to make sure that you are using the right tool for the kind of performance problem you are investigating.

In conclusion then, I'd say that my two main points in this post are:
  • Make sure you are clear on what the performance problem is that you are investigating
  • Check that AWR Reports are the right tool to be using to investigate this performance problem, rather than something else