Wednesday, 14 August 2013

Summary Reporting on AWR Data 1

The addition of the Automatic Workload Repository (AWR) in Oracle 10g was a good thing, as its regular hourly snapshots collect and store a number of database wide activity statistics which allows historical reporting later on. However, the Oracle supplied standard AWR report (awrrpt) only provides detailed information on the activity that happened in one period of time between 2 snapshots. There is no summary report of activity across all of the snapshots in the AWR. Which means that to use the standard AWR report you first have to know which period of time is the important one you should be looking at. So how do you identify the "most active" time period?

The lack of any standard reports or analysis that works on the whole of this data set and summarises it seems to be something of a missed opportunity to me. And something that is vital when analysing historical activity and trying to identify hot spots. So I decided to write my own reports summarising activity between each snapshot over a number of snapshots i.e. over a number of days.

For each snapshot the report lists the change in value of the main activity statistics since the previous snapshot, all on one output line per snapshot, ordered by snapshot time. I can then use this output to determine the most active period (e.g. high SQL executions, high DB time) or the highest wait events (wait time) or something else. Generally I do this by loading the output into Excel (fixed width format data fields) and either glancing down some critical columns of data, or graphing them and looking for spikes. And within Excel I can also enter formulae and derive other statistics e.g. Average Active Sessions from DB Time and Elapsed Time.

Your mileage will vary with my report - you may not want some of the statistics I report, you may want some other ones, or you may want them formatted differently. But you can always edit the SQL and tweak it to your satisfaction. The main benefit is that you get all of the activity history captured in the AWR summarised into key measurements per snapshot. And this can be used to easily identify the busiest period, and also examine bottlenecks that give rise to high wait events.

I won't give all the raw SQL, as it is very long with multiple joins between many AWR tables. What I will do is build up the components of the SQL in stages, so you can build your own customised AWR query. I did create this particular SQL query myself, rather than simply copying someone else's, but I have seen the same concept used by others - for instance Karl Aaro, and it is interesting to see both how other people have pulled out this data (the SQL they use), and which statistics they deem most important and why.

Core Query - List of Snapshots

The core query is one that provides the list of snapshots, and the elapsed time of each one. I use the "WITH" sub-query clause of the "SELECT" query to keep each sub-query separate and make the final query simpler and easier to read.
with
snaps as 
(select csnaps.snap_id
      , csnaps.dbid
      , min (csnaps.end_interval_time) end_snap_time
      , min (csnaps.end_interval_time) - min (csnaps.begin_interval_time) snap_interval
   from dba_hist_snapshot csnaps
  group by csnaps.snap_id, csnaps.dbid
) -- snaps
This provides a list of snapshot identifiers, which are a foreign key that appears in all other AWR tables, and identifies each snapshot of activity data itself. It also provides some other key data items, which can be important when summarising the data:
  • Database Identifier (dbid). This should be the same for all snapshots, but may not be under some circumstances. It is important to use in joins to other tables to ensure you are only getting activity data for the same database.
  • When the snapshot period finished (end_interval_time)
  • Snapshot duration (end time minus start time)
On a RAC system you will get separate snapshots of data for each instance, which means that there can be more than one end and start time per snapshot (one per instance). To get just one value for the start and end times I use the "min" aggregate function.

Database Time Model

This sub-query provides the set of Database Time Model data statistics per snapshot e.g. DB Time, sql execute elapsed time.
, systimes as 
-- One row per Database Time Model with change in value between snapshots
(select systime.snap_id
      , systime.dbid
      , systime.stat_name
      , sum (systime.value - psystime.value) value
         from dba_hist_sys_time_model systime, dba_hist_sys_time_model psystime
        where systime.snap_id = psystime.snap_id + 1
          and systime.dbid = psystime.dbid
          and systime.instance_number = psystime.instance_number
          and systime.stat_id = psystime.stat_id
-- Assume if stat_id the same so is the stat_name
        group by systime.snap_id, systime.dbid, systime.stat_name
) -- systimes
As before I group by the snapshot identifier, database identifier, and also statistic name. As the statistics values are cumulative, I am interested in the change in the value between two consecutive snapshots. So I join the Sys Time Model table to itself on the snapshot identifier being one more than the previous snapshot, and subtract the two data values (value column in this particular case) to get the change in value between the two snapshots. In case there is more than one data value per snapshot, which there will be on a RAC database, I also join by the instance number and add up the values together (sum aggregate function).

AWR Summary #1

Combining the two sub-queries together with suitable join conditions gives the following SQL to summarise database activity times between all snapshots:
with
snaps as 
...
, systimes as 
...
select to_char (snaps.end_snap_time, 'DD/MM/YY HH24:MI') snap_time
     , extract (second from snaps.snap_interval) 
       + (extract (minute from snaps.snap_interval) 
          + (extract (hour from snaps.snap_interval)
             + (extract (day from snaps.snap_interval) * 24)
             ) * 60
          ) * 60 snap_duration
     , dbtime.value / 1000000 db_time
     , sqlexectime.value / 1000000 sql_exec_time
  from snaps
     join (select * from systimes where stat_name = 'DB time') dbtime
       on snaps.snap_id = dbtime.snap_id and snaps.dbid = dbtime.dbid
     join (select * from systimes where stat_name = 'sql execute elapsed time') sqlexectime
       on snaps.snap_id = sqlexectime.snap_id and snaps.dbid = sqlexectime.dbid
 order by snaps.end_snap_time
/
Notes:
  • The "snap_interval" is an INTERVAL data type, so to convert to just seconds I need to extract each time field component, multiply it up and add on to the other time components.
  • The times in the Sys Time Model are in microseconds, so I divide by one million to get the value in seconds.
  • The Sys Time Model data is stored as name / value pairs, so I need to query the same data set twice restricting to one named statistic each time. I do this in an inline view within the query.
  • The joins between all data sets are on the Snapshot Identifier and the Database Identifier.
  • From these data items you can derive other useful numbers e.g. Average Active Sessions is DB Time divided by Snapshot Duration.
This reports everything in the AWR, but you can restrict it to certain days by adding in a suitable WHERE clause. Here is one I use in SQL*Plus that refers to 2 numbers passed in as arguments when you invoke the script.
 where snaps.end_snap_time between 
       (trunc (sysdate) - &1) and (trunc (sysdate) - &1 + &2)
You can also format the columns nicely in SQL*Plus by pre-defining column formats:
col snap_duration format 999,990.9
col db_time       format 999,990.99
col sql_exec_time format 999,990.99
In a future post I'll show some other sub-queries against the AWR tables that can be added in to pull out SQL statement execution statistics, and wait events, and other statistics.

No comments: