Monday, 25 June 2018

Interactive Complex SQL Scripts - A Solution

Often you will find that you need to run a relatively simple SQL query and see the results from it. Normally your choice of how to run this SQL query in Oracle is a straight choice between either using a plain SQL script executed through SQL*Plus, or using PL/SQL (probably in an anonymous block). But often things are not quite that simple or straightforward for you, and maybe there is an extra little bit of complication in how that SQL needs to be executed. Maybe you need to execute it repeatedly with a series of different values to use in a filter. Maybe you need to ask the user to enter a value, or choose a filter to use. Maybe you want to loop through the results of one query and run a second query for each of these. Maybe you need some simple conditional logic on the results and use that to choose which second query to run. Maybe you need to provide progress feedback to the user as a series of queries are executed, or as the results are retrieved of each query.

Unfortunately both SQL*Plus and PL/SQL each have drawbacks that stop them from being the "right" solution for all possible SQL query scenarios. I believe that I have identified a better solution for such cases where the query is straightforward enough and the conditional logic is also straightforward - use Python to execute the necessary SQL queries instead. It will do what you want, it can do much more than either SQL*Plus or PL/SQL can do themselves, and lets you be consistent in how you run all such non-simple SQL queries. I'll explain more of the details why I believe it to be a better fit for running ad-hoc complex SQL queries in a moment.

SQL*Plus or PL/SQL each have their own advantages for running different types of SQL query, but each also has a set of clear disadvantages that make them unsuitable for some types of SQL query.

SQL*Plus has its own set of features that make it quite flexible for interactively running SQL queries and seeing the results. These include things like "substitution variables" and prompting the user to enter values, and controlling formatting of the results output. It can execute any valid SQL statement, including DDL to do things like create new database objects. I find SQL*Plus very capable for writing an interactive script that can show the user some data, prompt them to enter a value, run other queries using that value as a filter, and display each result of each query as it gets them.

However, SQL*Plus has no direct "if then else" conditional logic support. People may find one way or another to achieve this, but it is not intrinsically a native feature of SQL*Plus, and such workarounds can look like kludges and be difficult to read and understand. It also has very limited error detection and handling abilities - basically it can stop if anything goes wrong, but that's about it. So there are clear limits to the abilities of SQL*Plus for executing SQL queries with any kind of conditional logic between them.

PL/SQL is Oracle's procedural language extensions to wrap such conditional logic and more around the execution of SQL queries, with good error detection and handling capabilities. It is definitely the language to use for things like stored procedures and triggers stored in the database itself. However as a general purpose language for executing SQL queries it has a number of severe limitations. The main one you keep running into is that it has little to no real input or output abilities for user interaction. It has no user input capabilities because it is designed to run within the database instance itself on the database server system, for which the concept of a real user at a screen with a keyboard does not make sense. So no real fault on PL/SQL's part for that. It has a very limited user output capability (DBMS_OUTPUT.PUT_LINE), because none of the output is visible to the user while the PL/SQL block is running. Instead all the output appears all together at once at the end when the PL/SQL has completed. Which is of course too late by then for progress and status reports.

PL/SQL also only supports DML SQL statements as native SQL statements in its code, partially because all native inline SQL statements are validated for correctness when the PL/SQL block is initially compiled i.e. when it is all parsed when submitted for creation or execution, and DDL is generally impossible to properly validate for various reasons. The workaround is to put such DDL SQL statements into local string variables in PL/SQL and then perform an "EXECUTE IMMEDIATE" on that string. While this does work, there are two major disadvantages:
  1. You lose the validation of the SQL at compile time when the PL/SQL is submitted, and instead can get a run time error when that particular SQL DDL is executed
  2. Your SQL DDL statements now become quoted strings rather than plain, inline SQL statements, making reading them more difficult (for two related reasons)
    • Recognising such SQL statements when embedded within string variables when reading the PL/SQL code
    • Complications when the SQL is very long and gets split over many separate lines, and has "variable" parts within it dependent on the values in other variables
A solution that I think deals with all the negatives of both SQL*Plus and PL/SQL while also supporting execution of any SQL statement is to use Python. I believe that it does not suffer from any of the problems that SQL*Plus or PL/SQL suffer from individually (i.e. it successfully deals with those particular issues and they are no longer problems in Python), while being a modern and relatively cleanly designed programming language. Thus you can use one programming language all the time for such one off complex SQL query scenarios, providing more consistency for you when doing these.

About the only negative I can think of for using Python is that SQL statements have to be put into string variables in the Python code and then passed to an "execute" method (a library function) to run it against the connected database. Which means that you do not get compile time validation of the SQL itself that PL/SQL might give you, and instead will only get run time errors when it tries to execute each SQL query in turn. But PL/SQL suffers from the same problem for DDL SQL and certain other variations of DML SQL (such as a variable list of column names in the SELECT), so you are no worse off really by using Python.

But in all other areas I believe it is superior to SQL*Plus and PL/SQL, and so worth the trade-off for what you gain (which is quite a lot) versus what you lose (SQL validation at compile time). Furthermore, if you use Python for all such SQL scripts you need, then you get a more consistent library of your own scripts to use. Often I start writing something in straight SQL + SQL*Plus and run into a dead end because of something like conditional logic. So then I have to rewrite it all as an anonymous PL/SQL block and add in a bunch of PL/SQL variable declarations. Using Python all the time means you are only using one language all the time, not two.

Note that I am not saying to use Python for all SQL executed against your Oracle database. Clearly things like business rules and logic should be in stored procedures in the database written in PL/SQL. And very simple SQL queries can be executed directly using SQL*Plus, I'm on about one off sets of non-simple SQL queries you want to run, or a report you want to be very interactive with user prompts for data values and feedback as the script progresses. In these cases I believe it is better to use Python (or maybe another programming language with SQL database execution support) than a mix of SQL*Plus and PL/SQL in different scripts.

Having proposed Python as the solution to the failings in both SQL*Plus and PL/SQL let me also add one more reason to consider using it. It is relatively easy to come up with a small set of templates for query execution that satisfy most possible scenarios. Then it is a matter of copying the appropriate template you have, editing that to add in the specific SQL queries you want executed, and then running that Python program. And a lot of this ability and power of Python is because it is a dynamic language - you do not need to declare variables in advance of using them in your code. This not only reduces the total number of lines of code needed for anything, but it also simplifies many of the issues of having to formally declare a named data type for a variable. Python will normally work out the data type needed from the source data itself that is being stored into the variable, so the same identical Python code can often be reused to process completely different source data types, which is a very powerful feature. Changing a data type in PL/SQL can involve a lot of changes to your source code. And this feature of Python becomes more powerful the more you become familiar with it and learn how to leverage it into truly generic code.

Python Examples

Here I will provide some Python code snippets that form a template and can be put together to run a SQL query for you. Note that I'm using Python 3 syntax which has a few key differences to Python 2 syntax e.g. "print" is a true function in Python 3. Also note that in Python the indentation level of statements is key, and tells Python when nested blocks of statements start and end. Indentation is not just about a pretty layout - it also tells Python how to interpret the meaning of your source code. And remember that variables do not have to be defined before you use them - Python determines the actual data type of a variable at run time when a value is stored in it.

First you need to connect to the database in question, and it may be easier to do this by prompting the user to enter their credentials rather than putting them on the command line when invoking Python.
#!/usr/bin/env python
import sys
import time
import cx_Oracle

print ("Enter database", end=": ", flush=True)
database = sys.stdin.readline ().rstrip ()
print ("Enter username", end=": ", flush=True)
username = sys.stdin.readline ().rstrip ()
print ("Enter password", end=": ", flush=True)
password = sys.stdin.readline ().rstrip ()
Note how the "sys.stdin.readline()" method is used to read from standard input what the user has typed. And how the string object method "rstrip()" is immediately used to strip off the newline character that is included in what "readline()" returns but which we don't need.

Personally I like to run a trivial query against the database immediately to show that the connection was successful and the database is up. In this example it queries both V$DATABASE for the database name, and then V$VERSION for the database software version.
print ()
print ("Connecting ...")

conn = cx_Oracle.connect (username, password, database)
print ("Connected successfully")

curs = conn.cursor ()

# V$DATABASE has only 1 row in it
curs.execute ("select name from v$database")
rows = curs.fetchall ()
print ("Connected to database {0}".format (rows [0][0]))

# Just the one row from V$VERSION
curs.execute ("select banner from v$version where banner like 'Oracle Database%'")
rows = curs.fetchall ()
print (rows[0][0])

print ()
Note how Python can retrieve all of the results at once into a nested set of arrays (lists as Python calls them) - a list of rows, and each row is a list of columns.

Also note that I am using the "cx_Oracle" module as the database driver for connecting to Oracle. Python defines a common database API that can be implemented by different driver modules for connecting to different databases. The "cx_Oracle" module is a third party module commonly used for connecting to Oracle databases. You can download a copy of "cx_Oracle" or another driver module and add it to your Python installation.
# Now do any real work we want to do, using the database connection just obtained
sql_query = """
select banner
  from v$version
print ("Executing SQL Query")
curs.execute (sql_query) 
rows = curs.fetchall ()
for i in range (len (rows)):
   print (rows[i][0])
The query here is just another trivial query against V$VERSION, but it could be anything you want it to be. It is very simple to iterate over all of the rows retrieved by a query in a "for" loop (len tells you the number of members in a list) and then print out the data retrieved. This is one way to iterate over the results by using "fetchall ()" to fetch all of the rows back explicitly, but there are a few other variations on this you can use depending on what you want to achieve and your preferences e.g. you can also iterate directly over the cursor itself which gives you each row of data values in turn.

Note how Python supports two different variations on defining string literals. Short string literals can be put inline between pairs of matching quotes - both single and double quotation characters are supported - and there are multiple cases of this here. Long string literals can be put between matching pairs of triple quotation characters and split over multiple lines, which is how I have done the SQL query here. The new line characters present in your source code are retained within the string value i.e. they are not stripped out when Python parses such a string literal value.

To execute a query in a loop with different filter values you should use bind variables rather than string concatenation (which is prone to SQL injection issues). In "cx_Oracle" we can use a leading colon in front of the bind variable name in the SQL query.
sql_query = """
select count (*) cnt
  from sales_history
 where period_id = :period_id
  GROUP BY product_id,
having count (*) > 1
for pid in range (1, 100) :
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : pid} # A dictionary of bind variable name to value mapping
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Period {1:d} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), pid, dup_count))
Note the use of the "range" function which returns a list of values from the first value to one less than the last value.

Lets say you don't know the list of "period id" values to iterate over (1 to 99 in previous example), and instead want to get those from another query. You can do that too:
# Calendar contains per day entries, but month ids are not contiguous
# So Select the distinct month calendar id ordered descending and iterate over those

monthcurs = conn.cursor ()
monthcurs.execute ("select distinct month_calendar_id from calendar order by 1 desc")

for monthrow in monthcurs :
    cmid = monthrow [0] # First and only column in each output row
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : cmid}
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Month ID {1} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), cmid, dup_count))
The above shows how Python lets you directly iterate over the rows in the first result set (for monthrow in monthcurs:) and using a value from each row execute the second query to count duplicates for that one period.

Monday, 21 May 2018

Advanced Compression no good for Data Warehousing

This post is about experiencing "table fetch continued row" in compressed data warehouse tables that experience some level of updates to existing (old) data, and how using the Advanced Compression option of Oracle Enterprise Edition not only does not help in avoiding this but can actually introduce severe performance penalties into your daily load jobs.


Basic Compression is a free part of Oracle Enterprise Edition, allowing you to compress the data within a table, and is useful for large databases such as a Data Warehouse. Not only will compression reduce the total disk storage used by that data in the table, it can also improve query performance in various ways - full table scans will read less disk blocks, and potentially more rows can be cached in the buffer cache in the SGA within the same number of blocks. One downside to Basic Compression is that it can only be used to compress existing data in a table or partition (done by "moving" that table or partition within the same tablespace). Newly inserted rows or updated rows are not compressed when added to a data block, and are added in their normal uncompressed form. This can mean that over time a compressed table or partition can become less compressed due to updates and newly inserted rows.

Oracle's official solution to this is the extra Advanced Compression Option with the OLTP Compression feature. OLTP Compression will compress the data of newly inserted rows into a table where it has been configured. It basically does this by inserting the rows in the normal uncompressed format (no performance penalty for these inserts) until the block becomes full, whereupon it then compresses that one block (performance penalty for that particular insert), which will also increase the free, unused space in that block for yet more rows to be inserted. The assumption is that overall for an average / normal OLTP application the frequency of block compressions will be low, and the impact on most transactions will be negligible (assuming they SELECT and INSERT on many other tables within the same business transaction).

A "perfect ideal" Data Warehouse would only ever load new data into the current partition of a partitioned table. Existing data would never be updated, and new data would never be inserted into a previous, old partition. In such a scenario you have two potentially straightforward strategies for compressing your data using only Basic Compression:
  • Wait until the end of a period when the database will switch to the next partition, and then compress the previous partition
  • As above plus re-compressing the current partition at a set frequency during the period in which it is the current partition
The first strategy means that all old partitions have the data in them compressed, and so take up minimal disk storage. However, the data in the current partition is not compressed at all. The old partitions only ever need to be compressed once, because this is an ideal scenario and that old data is never updated or extra data inserted into it.

The second strategy is the same as the first for old partitions, but also compresses the current partition at scheduled intervals to gain some of the benefits from compression on the most recently loaded data. How frequently you compress the current partition depends on various factors, such as the time to do this compression and the time available each night for the daily data load into the data warehouse. As the partition grows in size with more data being added to it, so the compression will take longer each time.

The Compressed Row Update Problem

Both strategies are straightforward and only use the features of Basic Compression. However, real world data warehouses are not "perfect" or "ideal" - old data can arrive after a period has ended and must be loaded into old, existing partitions, and corrections can come through for old data needing updates to the existing data. If your data volumes are very high, both within each partition and for the number of rows of old or corrected data, then your compressed partitions can become less compressed as a result and can grow in size.

As well as an increase in the disk storage used, there is also a potential performance penalty to pay for such data changes in old partitions due to the way the update physically happens within the data block. When the compressed row is updated it will not fit back into its original slot in the data block, because it will have grown in size one way or another. This means that Oracle will try and move that row to the end of the data block into the free, unused space area as reserved by the "PCTFREE" setting for that table. However, Basic Compression sets "PCTFREE" to zero by default, so there is nowhere for such updated rows to go within its current data block. What Oracle will do is "insert" that row into a new data block (as used for normal inserts), and leave behind a forward pointer in the block where the row used to be. This is termed "row migration" - the row's physical storage has migrated to another block within the table.

This row migration has a negative impact on index based reads, because the index structure has the ROWID of each data row in its leaf blocks, and the ROWID is really a map to the physical location of the data row - it specifies things such as the data file, block and offset within the block within the encoding in the ROWID. When a query using such an index goes to read a migrated data row it first reads the original data block as indicated by the ROWID and rather than the expected data row instead finds a forward pointer (another ROWID) to where the row has now been stored. So the query must issue a second read against the second data block in order to get the data row. Thus migrated updated rows cause extra disk reads when accessed through an index. Such "second, additional disk reads" are visible in the Oracle statistic of "table fetch continued row" i.e. you can see how many table row fetches cause an additional fetch due to migrated data rows.

Advanced Compression

In such scenarios you might assume that using the OLTP Compression feature of Advanced Compression would solve this. You would hope that it would keep the data in the old partitions compressed for you without any explicit action needed from you, and avoid or minimize the number of migrated rows. Unfortunately this is not true, and in fact using OLTP Compression will probably result in both a severe performance penalty during the daily data load and have no positive impact on updated rows being migrated. Which ultimately means that there is little to no point in using OLTP Compression on most data warehouses. Of course, there may be some scenarios where it might be of some benefit, but for the vast majority I believe it will only have a negative impact overall.

The first performance penalty is that it has a direct slow down on the INSERT's being executed by the daily data load jobs as the blocks are compressed as they fill up. This is to be expected of course, and you would hope that the performance penalty was relatively small and would ensure that all blocks in a table were compressed.

Unfortunately there is a second and more substantial performance penalty arising from this, because each block will be compressed multiple times during the data load. And this repeated compression of each block multiplies up the total CPU time cost used during the daily data load to the level where it can more than double the elapsed time of the data load jobs. This is a consequence of the way OLTP Compression works. Compression is triggered each time a block "becomes full", which will happen multiple times during the data load process. After a block is compressed for the first time, it will now have free unused space in it, into which new, subsequent rows can be inserted. So the same block will repeatedly fill up, compress, fill up again, compress, fill up again, compress, until eventually it is full of only compressed rows and another row cannot fit in the unused, free space.

And the third performance penalty is that the frequency of such block compressions is much higher with longer, larger data rows, such as typically found in a data warehouse. Longer rows will fill up a block sooner, triggering a compression sooner and more often as fewer rows are needed each time to consume the remaining free space i.e. the average number of rows inserted between a compression occurring gets lower as the row size increases. And the more compression operations that happen during a data load results in much more CPU time being consumed for a given number of rows. See the end of this post for the detail of the maths behind this.

This triple whammy combination - every INSERT slows down, blocks are compressed multiple times during the load, longer rows lead to more frequent block compressions - can cause data load times to double or triple in elapsed time. The precise level of the performance impact depends on various factors such as the rows sizes, the level of compression achievable, and the CPU's in your computer system.

Conversely Basic Compression only ever compresses each block once, so the total CPU used by it is far lower than for Advanced Compression. And because the compression operation takes place separately, there is no direct impact on the daily data loads themselves i.e. they continue to take the same elapsed time as normal.

But that is not the end of the negative impacts of using OLTP Compression in a data warehouse. The other big negative is that it does not actually solve the original problem you were experiencing - updated rows being migrated. Under OLTP Compression, only an INSERT will trigger a compression of a block, never an UPDATE. This has been shown by various people:
So OLTP Compression doesn't help at all with the original scenario - UPDATE's causing row migration with additional "table fetch continued row" block reads, and increased storage. Your data warehouse will still have migrated rows from updates, and an increase in storage and disk reads as a result. The only viable way to force migrated rows to move back and be compressed is to use Basic Compression.

And finally, for completeness, OLTP Compression typically achieves a slightly lower compression ratio than Basic Compression. This is because they both use the same compression mechanism, and OLTP Compression leaves the value of PCTFREE at the table / partition level setting unchanged (10% by default) whereas Basic Compression sets PCTFREE to 0 by default during the compression, so achieving more compressed rows per block. The difference in compression level will be very small and minor, but it is there nevertheless.


Although Basic Compression is not "perfect", for a data warehouse it is by far and away a better choice than trying to use OLTP Compression of the Advanced Compression option. The performance penalties from the significant increase in CPU usage during data loads when using OLTP Compression are very severe. And ultimately it does not even address the original problem we were concerned about - UPDATE's causing row migration into different data blocks, resulting in additional "table fetch continued row" block reads from queries needing those rows. And that's why sticking with the free Basic Compression used on a scheduled basis after a data load can be much more efficient for a data warehouse, combined with a well designed partitioning strategy.

Appendix - Why Longer Rows Lead To More Frequent Compressions

I stated earlier that when using Advanced OLTP Compression that longer rows will both trigger block compressions sooner, and more frequently i.e. more compressions on average per row (or conversely, fewer rows inserted to a block on average between compressions). I will here assume a simple 2:1 compression ratio between an uncompressed row and its compressed form as it keeps the maths a bit simpler, but still shows what is going on.

If you can fit 50 uncompressed rows into a block then you can fit 100 compressed rows into the same block. As the compression halves the size of the uncompressed rows, the block is compressed multiple times when data is inserted into it. First 50 rows are inserted, which triggers a compression leaving half the block free, so 25 more rows can be inserted, triggering another compression, then 13 rows can be inserted, the block is compressed, etc. Until eventually 100 rows fit in, but another one will not.

This series of halvings of how many rows can be loaded at a time is really like a sum over the series of powers of the number 2. The last load will be 1 row, the one before that of 2 rows, 4 before that, and so on. This means that the block is "full" when 2 to a certain power is larger than the number of rows that can fit in that block i.e. after that many compressions another row will not fit in the block.

For 100 rows in a block it will be 27 (2 to the power 7) of 128 needed, which is 7 compressions. If each block is compressed 7 times as it has rows inserted into it to achieve 100 rows in a block, then on average there is a compression every 100 / 7 = 14.29 rows.

For a much longer row that can only fit 10 uncompressed rows per block, then 20 compressed rows will fit in that block. And 25 is 32 (2 to the power 5) which is larger than 20, so 5 compressions will be needed to fill each block. As 20 rows will have been loaded into a block then a compression is occurring on average every 20 / 5 = 4 rows.

This is a lot more frequent than the previous case i.e. over 3 times worse on average! Thus OLTP Compression on a data warehouse with larger rows leads to a much larger CPU penalty from all of the repeated block compression that will be occurring during data loads. And why the alternative of just doing a normal load followed by a one-off compression using Basic Compression can be much more efficient.

Saturday, 21 April 2018

Top Oracle Monitoring Views & Tables


I can be a bit "old school" at times on some things, and I like to be familiar with the underlying technology being used by advanced features. So while using Enterprise Manager to see what is happening on your database can be easier and quicker than running queries manually, there may be times when EM is not available on a particular database or it doesn't provide a particular piece of information you are after. Knowing where this information is held in the database instance lets you go direct to the source to get what you need, which can either be quicker in some circumstances or the only option in other cases.

The key information about activity on a database instance is available in a handful of dynamic performance views. While using Enterprise Manager can be quicker and easier, don't be afraid of these dynamic performance views. Learn about them and what is in them and they can be your friends, and not your enemies.

Generally I break these activity related views down into current, real time activity happening now and historical activity that happened some time ago. And the historical activity can be further split into very recent as opposed to long ago activity, where "long ago" is probably more than an hour ago.

Current Session Activity

The main view has to be V$SESSION, which has a record per connected session. You can query V$SESSION to see what each session is currently doing (the SQL it is executing (if any), or if it is waiting or blocked, or idle), or query it repeatedly to see how a session changes over time (blocked or moving through different SQL).

This view has been expanded over the different versions of Oracle and now includes a richer set of data about what each session is doing spread across a number of useful columns e.g. STATUS, TYPE, SQL_ID, SQL_CHILD_NUMBER, ROW_WAIT_*, BLOCKING_SESSION, EVENT, WAIT_TIME, SECONDS_IN_WAIT, STATE.

When a session waits for something to complete, such as a disk wait, you can see this wait in V$SESSION. But this is transitory, being about the current wait that session is experiencing. These session wait times are also cumulatively recorded in V$SESSION_EVENT, which records both the wait event and the total wait time since the session connected to Oracle. You cannot just use the current, total values in this, as they are cumulative since the session was created. However you can compare the changes to this over a short period of time, which will tell you the events that the session waited on during that period of time, and how long it waited for them. Tanel Poder has given us the very useful snapper.sql script that snapshots this dynamic performance view in real time and reports on what changed i.e. the waits a session is experiencing.

Recent Active Session Activity

In version 10 (10.1) Oracle added in the new dynamic performance view of V$ACTIVE_SESSION_HISTORY (ASH), which contains point in time snapshot copies of data from V$SESSION. The idea is that you can now look back over recent history to see what was happening recently in individual sessions, letting you investigate issues soon after they have happened. Potentially it also lets you investigate temporary problems that do not persist long enough to investigate in real time using V$SESSION directly.

The manual says "It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn't belong to the Idle wait class." The one snapshot every second is to minimise any performance impact on the system and the storage requirements, and it also only saves the data for active sessions i.e. nothing is copied for inactive, idle sessions. The data from each snapshot is then held in memory within the SGA, and accessible via the dynamic performance view of V$ACTIVE_SESSION_HISTORY. There is a limit to how much memory it will use, but it is intended to cover the past 30 minutes.

Note that Active Session History and AWR (Automatic Workload Repository) are part of the "Oracle Diagnostic Pack", and a license for this must be purchased before you have the right to use either of these i.e. these features are not part of the core Oracle Database license. And such additional "Packs" are only available for the Enterprise Edition of the Oracle Database.

Most of the columns in V$ACTIVE_SESSION_HISTORY are the same as in V$SESSION, with the addition of a SAMPLE_TIME column. You can query this in similar ways to how you might query V$SESSION, but with the addition of a SAMPLE_TIME column, and that the same SESSION_ID can be repeated across different samples. Using ASH you can do things such as identifying the most frequent wait events, or sessions experiencing a specific wait event, or any session experiencing long waits.

Historical Session Activity

As stated, the data in V$ACTIVE_SESSION_HISTORY is held in memory in the SGA and is limited in size. Oracle will also automatically save 10% of this data to a table on disk in the AWR (Automatic Workload Repository), so you have an even longer period of history to query (up to the AWR retention limits) but at a lower sample rate. Even though the sample rate is much lower, the idea is that any peaks in activity or contention will still be captured and be available for analysis later on in DBA_HIST_ACTIVE_SESS_HISTORY.

The actual way it decides what samples to save is somewhat complicated i.e. not just a straight "1 in 10" of the samples from the in-memory ASH. It saves 1 in 10 of the sample sets when there was activity, but saves all of the session data from such a saved sample. This way you get all of the active session data from a saved sample, rather than just 1 in 10 of the active sessions and so be missing a complete picture of a sample set.

You can query DBA_HIST_ACTIVE_SESS_HISTORY in much the same way as you would query V$ACTIVE_SESSION_HISTORY, except that there are fewer samples covering longer periods of time.

Historical System Activity

The previous views are all about individual sessions, but Oracle also has a lot of information about other activity on the database as a whole in the other dynamic performance views. Most of these hold cumulative values - the data values only ever go up as new measurements are added in. In many cases their data is essentially a sum of the activity across all of the sessions in the system. These views can provide other useful information about what has been happening on a system. Oracle also takes snapshots of these data sets into the AWR, into a series of DBA_HIST_* tables, which you again need the Diagnostics Pack license for the right to query them.

The default snapshot interval is every hour, and they are retained for 7 days (these can be changed if desired, which can be useful). An AWR snapshot copies all data in all of the relevant dynamic performance views into their corresponding DBA_HIST_ table with a SNAP_ID column added in to identify the relevant snapshot. Unlike ASH, which is only a sampled subset, all of the data in the corresponding dynamic performance views is saved on each snapshot, so there are no issues over any missing data within a snapshot.

You can make use of this AWR data either by running a standard AWR report which summarises all activity between two snapshots, or by writing your own queries directly against these DBA_HIST_* tables for things of interest to you. The latter can be useful if you don't want to have to produce a set of AWR reports and then read through them manually to find something of interest. The right kind of queries can summarise key activity data across multiple snapshots, helping you identify any abnormalities or peak activity periods or anything else of interest.

In the same way that you might look at a dynamic performance view to see changes in real time, you can query the corresponding DBA_HIST_* table to see what changed over longer periods of time and between multiple snapshots. But your query will need to get two sets of snapshots (using the SNAP_ID column probably) and subtract the cumulative values of one from the other to get the change in value over that time i.e. the actual measure of the activity over that period of time. I've blogged about querying the AWR DBA_HIST_ tables directly before.

For reference here are a few dynamic performance views and the corresponding DBA_HIST_ table they get saved to by an AWR snapshot:
  • V$SYSSTAT (individual named system statistic counters) maps to DBA_HIST_SYSSTAT
  • V$SYSTEM_EVENT (wait event occurrence and time counters) maps to DBA_HIST_SYSTEM_EVENT and to DBA_HIST_BG_EVENT_SUMMARY for Background events
  • V$SQL has no direct map, but V$SQLSTATS maps to DBA_HIST_SQLSTAT

SQL Statements Executed

The other approach to investigating performance problems is to look at the SQL statements being executed, as opposed to what individual sessions are doing or the system as a whole. The most relevant dynamic performance views for SQL execution information are:
  • V$SESSION - contains the SQL_ID of the SQL statement a session is currently executing (if any)
  • V$SQL and V$SQLSTATS for all SQL statements recently executed, with execution statistics, etc.
    • Remember that these activity counters are cumulative and simply increase in value on each execution of that SQL statement
    • SQL statements are retained in V$SQL while they are being executed, but after that they can be removed for that slot to be reused by another SQL statement. So there is no guarantee that V$SQL contains all SQL statements recently executed
    • However, the most frequently executed SQL statements will tend to remain in V$SQL


The following dynamic performance views and AWR tables can directly provide you with information about what is and what has happened on an Oracle database instance, and are extremely useful for investigating performance problems. Remember that to use some of these you must have purchased the Diagnostics Pack for your Enterprise Edition Oracle database software.
  • V$SESSION for what is happening now in each session
  • V$SESSION_EVENT for total waits a session has experienced
  • V$ACTIVE_SESSION_HISTORY for recent session history
  • DBA_HIST_ACTIVE_SESS_HISTORY for a sampled subset of session history over a longer period of time
  • DBA_HIST_SYSSTAT, DBA_HIST_SYSTEM_EVENT, DBA_HIST_SQL_STAT and DBA_HIST_SYS_TIME_MODEL in the AWR for full copies of the corresponding dynamic performance view taken at regular intervals
  • V$SQL and V$SQLSTATS for the most recently executed SQL statements, still present in the shared pool in the SGA

Wednesday, 28 March 2018

Hash Join Overflow to Disk Summary

[I'm a bit busy at the moment working with a new client on a large and complex database, so I'm afraid I've only got time to do a quick post. I have a new post on a different topic planned, but am struggling with the time needed to write it up properly, so a short post is all I can squeeze in]

This is just a summary of the key results from my findings on how Oracle is costing a Hash Join when it overflows to disk, pulling all the conclusions together into one place.

In the first post I arrived at the following formula:
  • ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. column data storage + 2 bytes per column, and the "Row Count" values are per 1,000 rows.

In the second post I showed that the costing was unaffected by the value of the Multi-Block Read Count parameter i.e. the costing was done only using single block disk I/O's.

In the third post I proposed a simpler formula using only values shown in an execution plan of:
  • ( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485
This is only slightly less accurate than the original formula, but is much simpler to use as it only needs values reported in the execution plan, and no other calculations or derived values by you.

In the fourth post I showed that the reported size of the Temporary Space (TempSpc) in the execution plan is actually the size of the hash table itself, being dependent only on the size of the Build data set. And that knowing this hash table size will help you in deciding how much larger it is than the available memory in the PGA used to store such hash tables.

Wednesday, 14 February 2018

Hash Join Overflow Costing #4 - Temporary Space + PGA Use

Now I want to look at the value reported for "Temporary Space" (TempSpc) and how to check whether the hash table is close to fitting in memory within the PGA or not i.e. just a bit too big or way too big.

Temporary Space

It seems that this is simply the size of the hash table itself, which is in turn the size of the Build data set plus an additional 12 bytes per row. Which means that we can use part of the cost formula posted earlier for this to estimate the size in KiloBytes:
  • Build KBytes + (12 * Build Rows / 1000)
The evidence I have for this is:
  • The "TempSpc" value only changes when the Build size changes - never when the Probe size changes
  • When additional columns are added to Build data set in the SELECT the "TempSpc" value increases by the same amount as the Build size increases (for the same number of rows)
  • When the number of rows in the Build data set increases, the "TempSpc" value increases by an additional 11.76 bytes per additional row
    • Which is almost 12 bytes per row, as mentioned before as overhead in the hash table
You could of course "refine" the above formula and use "11.76" instead of "12" and that would give you a value for the hash table that is much closer to what is reported for the "TempSpc" in the execution plan. But I think that using 12 is good enough for only an estimate.

Hash Table Size vs PGA Size

I mentioned in the first post that the Optimizer uses the value of the internal initialization parameter "_smm_min_size" for the memory it can use as the work area for the hash table within the PGA. Thus if the size of the hash table is larger than this the Optimizer will cost the Hash Join as an overflow to disk operation. Using the formula posted above for the Temporary Space we can now estimate the size of the hash table needed and compare it to this minimum work area size value.

The notes I have on "_smm_min_size" say that it will be 0.1% of the PGA when the size of the PGA is greater than 128 MB, otherwise it will be 128 KB as a minimum. Assuming your PGA is larger than 128 MB then you can calculate what this is yourself. Note also that its value is reported in 10053 trace files along with all other initialization parameters, so you can directly verify its value if you want to.

At run time, though, the PGA may not be heavily used and may have a lot of unused memory available. An executing query can allocate and use more than the minimum work area value at run time for execution plan operations, up to a limit specified by the "_smm_max_size" initialization parameter. For PGAs larger than 1 GB this is set to be 10% of the PGA size. For smaller PGAs it is 100 MB, or 20% when the PGA is less than 500 MB.

Knowing how large the hash table needed will be, lets you compare it to the maximum possible work area size in your PGA and see what the likelihood is of the hash table being able to fit within the PGA at run time. As the ratio between the minimum and maximum possible work area sizes is 100 (0.1% to 10%) this is a large range and a hash table may have a reasonable chance of fitting in the PGA at run time, depending on how active your system is and the other queries running.

Equally it lets you consider what the benefit might be from increasing the size of the PGA significantly if you have a critical query involving a large hash join operation that is overflowing to disk. That said, there are some ways of rewriting queries that can minimise the size of the data being pulled through into the hash table used by the hash join, and so potentially get a smaller hash table to fit into memory in the PGA. But this depends on multiple factors, mainly being the number of columns of data coming through in the Build data set. If this is only a few columns then there won't anything that can be saved by such a query rewrite.

Tuesday, 6 February 2018

Hash Join Overflow Costing #3 - Simpler Formula

So far I have offered a formula for how a Hash Join that overflows to disk is costed and confirmed that this is only costed in terms of single block disk operations. While that formula produces very accurate results (less than 1% difference to the reported cost for the Hash Join operation) it requires you to obtain size information on each individual column being retrieved from both the Build and Probe data sets. And this could be quite tedious when there are many columns involved, or you don't have an easy way to work out the source columns involved in each data set. There is a simplification we can make to eliminate this detailed per column information and just use the information reported in the execution plan itself.

All of the columns being retrieved for the Build and Probe data sets are actually included within the "Bytes" value reported in the execution plan for each data set. And this value includes both the individual column data and any per-column overhead as well. What it does not include is the extra 12 bytes per row overhead needed in the hash table itself. We can approximate the size of this second part of the data by using the "Rows" value reported in the execution plan.

Examining the "Bytes" and "Rows" values in the execution plans for my test queries (see first post for details of these queries), I can see that the increase in the Hash Join cost is about 0.0485 per KB of increase in either data set (Build or Probe) for the same number of rows. Previously I determined that there was a 12 byte overhead within the hash table per data row expected.

This produces a revised Hash Join cost formula using only values from the execution plan of:
  • ( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485
Note that the "Bytes" values used need to be in "K" units, whereas the "Rows" is not and so is divided by 1000. Checking this formula against the actual Hash Join costs of the test queries I ran I can see that it has an error of about 1% i.e. it is not as accurate as the previous formula, but is still accurate enough I believe.

Lets check how close this from one of the test queries. Here is the execution plan produced:
| Id  | Operation     | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT    |      |      |      |      | 1266 (100)|          |
|   1 |  SORT AGGREGATE     |      |    1 |   30 |      |           |          |
|*  2 |   HASH JOIN     |      | 50000 | 1464K| 1712K| 1266   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| HJ50K  | 50000 | 1123K|      |  365   (1)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| HJ100K |  100K|  683K|      |  728   (1)| 00:00:01 |
We can see:
  • The Build and Probe data access cost is 365 + 728 = 1093
  • The cost of the Hash Join operation itself is 1266 - 1093 = 173
  • The calculated approximate cost is (1123 + (12 * 50) + 683 + (12 * 100)) * 0.0485
    • = (1123 + 600 + 683 + 1200) * 0.0485
    • = 3606 * 0.0485 
    • = 174.891
  • The difference is +1.891, which is 1.1% of the actual cost
This formula can therefore be a useful check against the cost reported when a Hash Join operation overflows to disk, and for determining which is the biggest cost factor i.e. the Build or the Probe data sets.

Tuesday, 23 January 2018

Hash Join Overflow Costing #2

In my previous post on costing a Hash Join overflow to disk I came up with a formula for the cost of a Hash Join operation that overflows to disk based on tests I had done, but I also mentioned that there might be other factors involved not yet exposed by my testing. My main concern was whether the disk I/O's involved were all of the same type, or a mix of single block and multi-block disk I/O's. I've now done some further testing, and have something to share.

The tests show that all the disk I/O's are costed by the Optimizer as single block disk I/O's. I say this because I changed the cost of multi-block disk I/O's and the cost reported for the Hash Join operation itself did not change. Hence it must be costed as single block disk I/O's by the Optimizer.

The tests I was doing were a join between two tables that used a Hash Join in the execution plan (see previous post for the SQL used in this query). The key points are:
  • The two source tables are accessed by Full Table Scans (TABLE ACCESS FULL) in the execution plan
  • These feed into a HASH JOIN operation
  • Which in turn feeds into a final operation - a SORT AGGREGATE for the SUM's in the SELECT output
  • By adding together the costs of the source data accesses and subtracting from the total cost reported for the Hash Join, we get the net cost of the Hash Join operation itself
Oracle costs disk I/O operations using a set of statistics it has about the computer system it is running on - termed System Statistics and stored in the database itself. These can be a minimal set of just a few statistics or a more complete set of statistics providing a more detailed breakdown of different types of operations on the computer.

In my Oracle 12 instance ( I only have the "No Workload" default system statistics set in the database, which is the minimal set of statistics. This provides values for the following:
  • IOSEEKTIM - Time in milliseconds for a disk to move the read / write head to the track you want to access - default value of 10
  • IOTFRSPEED - Bytes transferred per millisecond once a transfer starts - default value of 4096
  • MBRC - Multi-block read count in blocks achieved on average - default value of 8
  • CPUSPEEDNW - Speed of CPU in units of millions of cycles per second (not relevant here)
Note that the MBRC here is not the same as the initialization parameter db_file_multiblock_read_count, even though their definitions seem to be the same. This MBRC is not set from the initialization parameter, and remains the same within the system statistics until explicitly changed (one way or another). MBRC is intended to represent the actual size of multi-block reads that were achieved on your computer system, and is set when Workload statistics are collected or when set manually.

The Optimizer uses these minimal system statistics to derive the values for the elapsed times for single and multi-block reads (SREADTIM and MREADTIM respectively). The formulae it uses are:
With the default values for System Statistics as stated, and a database block size of 8192 (8 KB) this gives values of SREADTIM = 12 ms, MREADTIM = 26 ms.

What this really means for the Optimizer is that a multi-block read takes 26 / 12 = 2.167 times longer than a single block read i.e. its "cost" is 2.167 times that of a single block read. This is used by the Optimizer when costing multi-block read based operations - it "converts" the multi-block read time into an equivalent number of single block read operations, which all have a cost unit of 1 and are the basis for how the Optimizer reports and compares execution plan costs.

If I were to increase the MBRC from 8 to 16, then the MREADTIM would increase to 42 ms, and a cost factor of 42 / 12 = 3.5 times that of a single block read. The relative net cost though would actually decrease, because each multi-block read would now be reading 16 blocks in one disk I/O rather than 8 blocks i.e. half the number of disk reads are needed to read the same number of blocks in from disk, at a relative cost ratio of 1.615 (3.5 / 2.167). So the decrease in the number of disk reads is greater than the relative increase in cost per disk read.

If the Hash Join operation involved multi-block disk reads then changing MBRC would change the relative cost of those disk reads and we would see a change in the net cost of the Hash Join as reported in the execution plan.

I changed the value of MBRC manually and then shutdown and restarted the database:
exec dbms_stats.set_system_stats ('MBRC', 16)
shutdown immediate
This was to ensure that nothing was still present in memory in the SGA - neither any existing execution plans, or any system statistics used to cost those execution plans.

When I ran the same test queries again, the total costs decreased because the source data accesses are full table scans which use multi-block reads i.e. this was expected, and so the total cost reported at the Hash Join step decreased. However, when the costs of the source data accesses were subtracted from the total cost after the Hash Join step, the net cost of the Hash Join operation itself was exactly the same as it was before for the same query.

Likewise when I increased the MBRC value to 32 there was no change in the net cost of the Hash Join operation when it overflowed to disk.

The conclusion then is that no multi-block disk reads are used within the costing of the Hash Join operation by the Optimizer when the hash table it uses overflows to disk.

Which means that the formula I posted before for an overflowing Hash Join cost is not dependent on multi-block reads at all:-
  • ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.