Databases and Performance

Time SQL Execution with Python

2019-01-19T12:23:00.000+00:00

I've said before in this blog how I find Python to be very useful for doing various things, including processing data to or from an Oracle database. Here is another example where a relatively simple and straightforward piece of Python code can deliver something that is very useful - in this case measuring the elapsed time of SQL queries executed on an Oracle database. We often need to execute a given SQL query and see how long it takes to complete as part of a tuning or monitoring exercise, but running it via SQL*Plus or something else can tie up a session for an extended period of time and produce large amounts of output that are spooled to the screen. Conversely we don't want to run the SQL query and stop when only the first data row is returned, as that is not a "full" execution of the query. We can use Python to achieve this - execute a query to completion, consuming all the data it produces but not outputting any of it, and measure the elapsed time. And we can execute this outside of any SQL*Plus or SQL Developer or other window we might be using.

The key requirements for this utility are:

Connect to an Oracle database as a specified user
Execute the given SQL query, unmodified, so the execution plan used should match that when the same SQL is submitted by the real application
Consume all the rows of data produced, for a true time to complete measurement

This covers most scenarios but does leave a few out - no bind variables can be used, for instance.

There are three parts to achieving this really

Connect to the Oracle database using the supplied credentials
Read in the SQL query text from a file
Execute against an Oracle database connection and consume the results noting the start and end time of this

And each of these parts involves relatively few lines of Python code to achieve what it does. And part of the reason for this is Python's dynamic data typing of variables - you don't have to declare the data types of the data the SQL query produces in advance of retrieving it. Which means that it can handle almost any query producing any number of values per output row.

The code I will now present is not "perfect" in an absolute sense - there is very limited error handling for instance - but it is complete and does work. I've used it many times to execute a SQL query pulled out from an application that is performing poorly, to verify its execution and the impact of any changes made to try and improve performance.

Python Code

Connecting to an Oracle database
I use cx_Oracle as the database access module for Oracle, and connection is relatively simple. This assumes that the database credentials are in variables named "user_name", "user_password" and "database_name".

import cx_Oracle

# Other code here, such as parsing command line arguments to get values into necessary variables

db_conn = cx_Oracle.connect (user_name, user_password, database_name)

Reading in the SQL query from a file
Assume that the name of the file containing the query is in the string variable "SQL_File_Name".

sql_query = open (SQL_File_Name, "r").read ().rstrip (' ;\n') # Strip any semi-colon off from the end

This does several things in one combined statement:

Opens the file for reading
Reads in the content of the file returning it as a single string
Strips off certain trailing characters at the end of the string - space, semi-colon or new line character.

We now have in the "sql_query" variable the text string of the SQL query with no trailing semi-colon (if there was one). This is now ready for use to execute against an Oracle database.

Executing the query and timing it
For now assume that a database connection has been made and is referenced by the object named "db_conn".

To execute the SQL query we first need to get a cursor from the database connection, then execute the SQL using it, retrieve all the rows of data, and report the elapsed time:

import time

try :
    curs = db_conn.cursor ()  
    start_time = time.time ()  
    curs.execute (sql_query) 
    # Now retrieve the rows of data produced by the query
    row_count = 0  
    rows = curs.fetchmany (1000)  
    # Empty list will be False so can test it directly  
    while rows :  
        row_count += len (rows)  
        rows = curs.fetchmany (1000)  
    end_time = time.time ()  
    print ("Query Complete - Elapsed {0:4.1f}s for {1} rows".format ((end_time - start_time), row_count))  
except Exception as ex :
    logging.critical ("ERROR - Exception raised during processing")
    logging.critical ("Exception - %s" % (ex))

I've included exception handling in case there is something wrong with the SQL query or its execution, but there is not a lot we can do about it so it just prints out some information about the exception it caught.

More details
I actually allow all the necessary values to be passed in from the command line to this utility program, and I use the Python "argparse" module to process these and map them to corresponding variables. So you can either assign the results from "argparse" to the discrete variables I have used in the code examples, or instead replace those references by the results produced by "argparse". The results will be the same either way.

The following code can be placed before the code given so far in the final Python program:

import argparse
import sys

# Parse command line arguments
parser = argparse.ArgumentParser (description="Time execution of a SQL query")
# Add command line argument definitions one by one into the parser
parser.add_argument ("-u", "--user",      help="User name", required=False, default=None)
parser.add_argument ("-p", "--password",  help="Password", required=False, default=None)
parser.add_argument ("-d", "--database",  help="Database name", required=False, default=None)
parser.add_argument ("SQL_File", help="Name of SQL File to execute", default=None)
args = parser.parse_args ()
# First check whether we did get a user name, password etc on the command line
# If not, prompt the user for what we are missing
if args.database is None :
    print ("Enter database", end=": ", flush=True)
    database_name = sys.stdin.readline ().rstrip ()
else :
    database_name = args.database
if args.user is None :
    print ("Enter username", end=": ", flush=True)
    user_name = sys.stdin.readline ().rstrip ()
else :
    user_name = args.user
if args.password is None :
    print ("Enter password", end=": ", flush=True)
    user_password = sys.stdin.readline ().rstrip ()
else :
    user_password = args.password
SQL_File_Name = args.SQL_File

# Get Database Connection ...

# Execute SQL and consume results ...

Summary

This relatively short and straightforward Python program gives me a utility to time the execution of any SQL query against an Oracle database, including consuming all of the data rows the query produces. For me this has many advantages over other ways of executing SQL queries when investigating performance problems, and is something I use whenever needed. One advantage is that because it is executing the original SQL query with no modifications, it will be using the same execution plan as when that SQL query is executed from an application. Which in turn means that if I do modify the SQL in some way and manage to "improve" its performance, the new execution plan should be the same one used when the same improved SQL is executed from the application code.

Interactive Complex SQL Scripts - A Solution

2018-06-25T14:34:00.000+01:00

Often you will find that you need to run a relatively simple SQL query and see the results from it. Normally your choice of how to run this SQL query in Oracle is a straight choice between either using a plain SQL script executed through SQL*Plus, or using PL/SQL (probably in an anonymous block). But often things are not quite that simple or straightforward for you, and maybe there is an extra little bit of complication in how that SQL needs to be executed. Maybe you need to execute it repeatedly with a series of different values to use in a filter. Maybe you need to ask the user to enter a value, or choose a filter to use. Maybe you want to loop through the results of one query and run a second query for each of these. Maybe you need some simple conditional logic on the results and use that to choose which second query to run. Maybe you need to provide progress feedback to the user as a series of queries are executed, or as the results are retrieved of each query.

Unfortunately both SQL*Plus and PL/SQL each have drawbacks that stop them from being the "right" solution for all possible SQL query scenarios. I believe that I have identified a better solution for such cases where the query is straightforward enough and the conditional logic is also straightforward - use Python to execute the necessary SQL queries instead. It will do what you want, it can do much more than either SQL*Plus or PL/SQL can do themselves, and lets you be consistent in how you run all such non-simple SQL queries. I'll explain more of the details why I believe it to be a better fit for running ad-hoc complex SQL queries in a moment.

SQL*Plus or PL/SQL each have their own advantages for running different types of SQL query, but each also has a set of clear disadvantages that make them unsuitable for some types of SQL query.

SQL*Plus has its own set of features that make it quite flexible for interactively running SQL queries and seeing the results. These include things like "substitution variables" and prompting the user to enter values, and controlling formatting of the results output. It can execute any valid SQL statement, including DDL to do things like create new database objects. I find SQL*Plus very capable for writing an interactive script that can show the user some data, prompt them to enter a value, run other queries using that value as a filter, and display each result of each query as it gets them.

However, SQL*Plus has no direct "if then else" conditional logic support. People may find one way or another to achieve this, but it is not intrinsically a native feature of SQL*Plus, and such workarounds can look like kludges and be difficult to read and understand. It also has very limited error detection and handling abilities - basically it can stop if anything goes wrong, but that's about it. So there are clear limits to the abilities of SQL*Plus for executing SQL queries with any kind of conditional logic between them.

PL/SQL is Oracle's procedural language extensions to wrap such conditional logic and more around the execution of SQL queries, with good error detection and handling capabilities. It is definitely the language to use for things like stored procedures and triggers stored in the database itself. However as a general purpose language for executing SQL queries it has a number of severe limitations. The main one you keep running into is that it has little to no real input or output abilities for user interaction. It has no user input capabilities because it is designed to run within the database instance itself on the database server system, for which the concept of a real user at a screen with a keyboard does not make sense. So no real fault on PL/SQL's part for that. It has a very limited user output capability (DBMS_OUTPUT.PUT_LINE), because none of the output is visible to the user while the PL/SQL block is running. Instead all the output appears all together at once at the end when the PL/SQL has completed. Which is of course too late by then for progress and status reports.

PL/SQL also only supports DML SQL statements as native SQL statements in its code, partially because all native inline SQL statements are validated for correctness when the PL/SQL block is initially compiled i.e. when it is all parsed when submitted for creation or execution, and DDL is generally impossible to properly validate for various reasons. The workaround is to put such DDL SQL statements into local string variables in PL/SQL and then perform an "EXECUTE IMMEDIATE" on that string. While this does work, there are two major disadvantages:

You lose the validation of the SQL at compile time when the PL/SQL is submitted, and instead can get a run time error when that particular SQL DDL is executed
Your SQL DDL statements now become quoted strings rather than plain, inline SQL statements, making reading them more difficult (for two related reasons)
- Recognising such SQL statements when embedded within string variables when reading the PL/SQL code
- Complications when the SQL is very long and gets split over many separate lines, and has "variable" parts within it dependent on the values in other variables

A solution that I think deals with all the negatives of both SQL*Plus and PL/SQL while also supporting execution of any SQL statement is to use Python. I believe that it does not suffer from any of the problems that SQL*Plus or PL/SQL suffer from individually (i.e. it successfully deals with those particular issues and they are no longer problems in Python), while being a modern and relatively cleanly designed programming language. Thus you can use one programming language all the time for such one off complex SQL query scenarios, providing more consistency for you when doing these.

About the only negative I can think of for using Python is that SQL statements have to be put into string variables in the Python code and then passed to an "execute" method (a library function) to run it against the connected database. Which means that you do not get compile time validation of the SQL itself that PL/SQL might give you, and instead will only get run time errors when it tries to execute each SQL query in turn. But PL/SQL suffers from the same problem for DDL SQL and certain other variations of DML SQL (such as a variable list of column names in the SELECT), so you are no worse off really by using Python.

But in all other areas I believe it is superior to SQL*Plus and PL/SQL, and so worth the trade-off for what you gain (which is quite a lot) versus what you lose (SQL validation at compile time). Furthermore, if you use Python for all such SQL scripts you need, then you get a more consistent library of your own scripts to use. Often I start writing something in straight SQL + SQL*Plus and run into a dead end because of something like conditional logic. So then I have to rewrite it all as an anonymous PL/SQL block and add in a bunch of PL/SQL variable declarations. Using Python all the time means you are only using one language all the time, not two.

Note that I am not saying to use Python for all SQL executed against your Oracle database. Clearly things like business rules and logic should be in stored procedures in the database written in PL/SQL. And very simple SQL queries can be executed directly using SQL*Plus, I'm on about one off sets of non-simple SQL queries you want to run, or a report you want to be very interactive with user prompts for data values and feedback as the script progresses. In these cases I believe it is better to use Python (or maybe another programming language with SQL database execution support) than a mix of SQL*Plus and PL/SQL in different scripts.

Having proposed Python as the solution to the failings in both SQL*Plus and PL/SQL let me also add one more reason to consider using it. It is relatively easy to come up with a small set of templates for query execution that satisfy most possible scenarios. Then it is a matter of copying the appropriate template you have, editing that to add in the specific SQL queries you want executed, and then running that Python program. And a lot of this ability and power of Python is because it is a dynamic language - you do not need to declare variables in advance of using them in your code. This not only reduces the total number of lines of code needed for anything, but it also simplifies many of the issues of having to formally declare a named data type for a variable. Python will normally work out the data type needed from the source data itself that is being stored into the variable, so the same identical Python code can often be reused to process completely different source data types, which is a very powerful feature. Changing a data type in PL/SQL can involve a lot of changes to your source code. And this feature of Python becomes more powerful the more you become familiar with it and learn how to leverage it into truly generic code.

Python Examples

Here I will provide some Python code snippets that form a template and can be put together to run a SQL query for you. Note that I'm using Python 3 syntax which has a few key differences to Python 2 syntax e.g. "print" is a true function in Python 3. Also note that in Python the indentation level of statements is key, and tells Python when nested blocks of statements start and end. Indentation is not just about a pretty layout - it also tells Python how to interpret the meaning of your source code. And remember that variables do not have to be defined before you use them - Python determines the actual data type of a variable at run time when a value is stored in it.

First you need to connect to the database in question, and it may be easier to do this by prompting the user to enter their credentials rather than putting them on the command line when invoking Python.

#!/usr/bin/env python
#
import sys
import time
import cx_Oracle

print ("Enter database", end=": ", flush=True)
database = sys.stdin.readline ().rstrip ()
print ("Enter username", end=": ", flush=True)
username = sys.stdin.readline ().rstrip ()
print ("Enter password", end=": ", flush=True)
password = sys.stdin.readline ().rstrip ()

Note how the "sys.stdin.readline()" method is used to read from standard input what the user has typed. And how the string object method "rstrip()" is immediately used to strip off the newline character that is included in what "readline()" returns but which we don't need.

Personally I like to run a trivial query against the database immediately to show that the connection was successful and the database is up. In this example it queries both V$DATABASE for the database name, and then V$VERSION for the database software version.

print ()
print ("Connecting ...")

conn = cx_Oracle.connect (username, password, database)
print ("Connected successfully")

curs = conn.cursor ()

# V$DATABASE has only 1 row in it
curs.execute ("select name from v$database")
rows = curs.fetchall ()
print ("Connected to database {0}".format (rows [0][0]))

# Just the one row from V$VERSION
curs.execute ("select banner from v$version where banner like 'Oracle Database%'")
rows = curs.fetchall ()
print (rows[0][0])

print ()

Note how Python can retrieve all of the results at once into a nested set of arrays (lists as Python calls them) - a list of rows, and each row is a list of columns.

Also note that I am using the "cx_Oracle" module as the database driver for connecting to Oracle. Python defines a common database API that can be implemented by different driver modules for connecting to different databases. The "cx_Oracle" module is a third party module commonly used for connecting to Oracle databases. You can download a copy of "cx_Oracle" or another driver module and add it to your Python installation.

# Now do any real work we want to do, using the database connection just obtained
#
sql_query = """
select banner
  from v$version
"""
print ("Executing SQL Query")
curs.execute (sql_query) 
rows = curs.fetchall ()
for i in range (len (rows)):
   print (rows[i][0])

The query here is just another trivial query against V$VERSION, but it could be anything you want it to be. It is very simple to iterate over all of the rows retrieved by a query in a "for" loop (len tells you the number of members in a list) and then print out the data retrieved. This is one way to iterate over the results by using "fetchall ()" to fetch all of the rows back explicitly, but there are a few other variations on this you can use depending on what you want to achieve and your preferences e.g. you can also iterate directly over the cursor itself which gives you each row of data values in turn.

Note how Python supports two different variations on defining string literals. Short string literals can be put inline between pairs of matching quotes - both single and double quotation characters are supported - and there are multiple cases of this here. Long string literals can be put between matching pairs of triple quotation characters and split over multiple lines, which is how I have done the SQL query here. The new line characters present in your source code are retained within the string value i.e. they are not stripped out when Python parses such a string literal value.

To execute a query in a loop with different filter values you should use bind variables rather than string concatenation (which is prone to SQL injection issues). In "cx_Oracle" we can use a leading colon in front of the bind variable name in the SQL query.

sql_query = """
select count (*) cnt
  from sales_history
 where period_id = :period_id
  GROUP BY product_id,
         customer_id,
         territory_id,
         distributor_id,
         period_id,
         sale_type,
         currency_code
having count (*) > 1
"""
#
for pid in range (1, 100) :
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : pid} # A dictionary of bind variable name to value mapping
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Period {1:d} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), pid, dup_count))

Note the use of the "range" function which returns a list of values from the first value to one less than the last value.

Lets say you don't know the list of "period id" values to iterate over (1 to 99 in previous example), and instead want to get those from another query. You can do that too:

# Calendar contains per day entries, but month ids are not contiguous
# So Select the distinct month calendar id ordered descending and iterate over those

monthcurs = conn.cursor ()
monthcurs.execute ("select distinct month_calendar_id from calendar order by 1 desc")

for monthrow in monthcurs :
    cmid = monthrow [0] # First and only column in each output row
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : cmid}
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Month ID {1} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), cmid, dup_count))

The above shows how Python lets you directly iterate over the rows in the first result set (for monthrow in monthcurs:) and using a value from each row execute the second query to count duplicates for that one period.

Advanced Compression no good for Data Warehousing

2018-05-21T13:14:00.000+01:00

This post is about experiencing "table fetch continued row" in compressed data warehouse tables that experience some level of updates to existing (old) data, and how using the Advanced Compression option of Oracle Enterprise Edition not only does not help in avoiding this but can actually introduce severe performance penalties into your daily load jobs.

Background

Basic Compression is a free part of Oracle Enterprise Edition, allowing you to compress the data within a table, and is useful for large databases such as a Data Warehouse. Not only will compression reduce the total disk storage used by that data in the table, it can also improve query performance in various ways - full table scans will read less disk blocks, and potentially more rows can be cached in the buffer cache in the SGA within the same number of blocks. One downside to Basic Compression is that it can only be used to compress existing data in a table or partition (done by "moving" that table or partition within the same tablespace). Newly inserted rows or updated rows are not compressed when added to a data block, and are added in their normal uncompressed form. This can mean that over time a compressed table or partition can become less compressed due to updates and newly inserted rows.

Oracle's official solution to this is the extra Advanced Compression Option with the OLTP Compression feature. OLTP Compression will compress the data of newly inserted rows into a table where it has been configured. It basically does this by inserting the rows in the normal uncompressed format (no performance penalty for these inserts) until the block becomes full, whereupon it then compresses that one block (performance penalty for that particular insert), which will also increase the free, unused space in that block for yet more rows to be inserted. The assumption is that overall for an average / normal OLTP application the frequency of block compressions will be low, and the impact on most transactions will be negligible (assuming they SELECT and INSERT on many other tables within the same business transaction).

A "perfect ideal" Data Warehouse would only ever load new data into the current partition of a partitioned table. Existing data would never be updated, and new data would never be inserted into a previous, old partition. In such a scenario you have two potentially straightforward strategies for compressing your data using only Basic Compression:

Wait until the end of a period when the database will switch to the next partition, and then compress the previous partition
As above plus re-compressing the current partition at a set frequency during the period in which it is the current partition

The first strategy means that all old partitions have the data in them compressed, and so take up minimal disk storage. However, the data in the current partition is not compressed at all. The old partitions only ever need to be compressed once, because this is an ideal scenario and that old data is never updated or extra data inserted into it.

The second strategy is the same as the first for old partitions, but also compresses the current partition at scheduled intervals to gain some of the benefits from compression on the most recently loaded data. How frequently you compress the current partition depends on various factors, such as the time to do this compression and the time available each night for the daily data load into the data warehouse. As the partition grows in size with more data being added to it, so the compression will take longer each time.

The Compressed Row Update Problem

Both strategies are straightforward and only use the features of Basic Compression. However, real world data warehouses are not "perfect" or "ideal" - old data can arrive after a period has ended and must be loaded into old, existing partitions, and corrections can come through for old data needing updates to the existing data. If your data volumes are very high, both within each partition and for the number of rows of old or corrected data, then your compressed partitions can become less compressed as a result and can grow in size.

As well as an increase in the disk storage used, there is also a potential performance penalty to pay for such data changes in old partitions due to the way the update physically happens within the data block. When the compressed row is updated it will not fit back into its original slot in the data block, because it will have grown in size one way or another. This means that Oracle will try and move that row to the end of the data block into the free, unused space area as reserved by the "PCTFREE" setting for that table. However, Basic Compression sets "PCTFREE" to zero by default, so there is nowhere for such updated rows to go within its current data block. What Oracle will do is "insert" that row into a new data block (as used for normal inserts), and leave behind a forward pointer in the block where the row used to be. This is termed "row migration" - the row's physical storage has migrated to another block within the table.

This row migration has a negative impact on index based reads, because the index structure has the ROWID of each data row in its leaf blocks, and the ROWID is really a map to the physical location of the data row - it specifies things such as the data file, block and offset within the block within the encoding in the ROWID. When a query using such an index goes to read a migrated data row it first reads the original data block as indicated by the ROWID and rather than the expected data row instead finds a forward pointer (another ROWID) to where the row has now been stored. So the query must issue a second read against the second data block in order to get the data row. Thus migrated updated rows cause extra disk reads when accessed through an index. Such "second, additional disk reads" are visible in the Oracle statistic of "table fetch continued row" i.e. you can see how many table row fetches cause an additional fetch due to migrated data rows.

Advanced Compression

In such scenarios you might assume that using the OLTP Compression feature of Advanced Compression would solve this. You would hope that it would keep the data in the old partitions compressed for you without any explicit action needed from you, and avoid or minimize the number of migrated rows. Unfortunately this is not true, and in fact using OLTP Compression will probably result in both a severe performance penalty during the daily data load and have no positive impact on updated rows being migrated. Which ultimately means that there is little to no point in using OLTP Compression on most data warehouses. Of course, there may be some scenarios where it might be of some benefit, but for the vast majority I believe it will only have a negative impact overall.

The first performance penalty is that it has a direct slow down on the INSERT's being executed by the daily data load jobs as the blocks are compressed as they fill up. This is to be expected of course, and you would hope that the performance penalty was relatively small and would ensure that all blocks in a table were compressed.

Unfortunately there is a second and more substantial performance penalty arising from this, because each block will be compressed multiple times during the data load. And this repeated compression of each block multiplies up the total CPU time cost used during the daily data load to the level where it can more than double the elapsed time of the data load jobs. This is a consequence of the way OLTP Compression works. Compression is triggered each time a block "becomes full", which will happen multiple times during the data load process. After a block is compressed for the first time, it will now have free unused space in it, into which new, subsequent rows can be inserted. So the same block will repeatedly fill up, compress, fill up again, compress, fill up again, compress, until eventually it is full of only compressed rows and another row cannot fit in the unused, free space.

And the third performance penalty is that the frequency of such block compressions is much higher with longer, larger data rows, such as typically found in a data warehouse. Longer rows will fill up a block sooner, triggering a compression sooner and more often as fewer rows are needed each time to consume the remaining free space i.e. the average number of rows inserted between a compression occurring gets lower as the row size increases. And the more compression operations that happen during a data load results in much more CPU time being consumed for a given number of rows. See the end of this post for the detail of the maths behind this.

This triple whammy combination - every INSERT slows down, blocks are compressed multiple times during the load, longer rows lead to more frequent block compressions - can cause data load times to double or triple in elapsed time. The precise level of the performance impact depends on various factors such as the rows sizes, the level of compression achievable, and the CPU's in your computer system.

Conversely Basic Compression only ever compresses each block once, so the total CPU used by it is far lower than for Advanced Compression. And because the compression operation takes place separately, there is no direct impact on the daily data loads themselves i.e. they continue to take the same elapsed time as normal.

But that is not the end of the negative impacts of using OLTP Compression in a data warehouse. The other big negative is that it does not actually solve the original problem you were experiencing - updated rows being migrated. Under OLTP Compression, only an INSERT will trigger a compression of a block, never an UPDATE. This has been shown by various people:

Oracle Blog post on Updates in Row Compressed Tables
Oracle Blog post of a copy of the Master Note for OLTP Compression
An article series on Compression by Jonathan Lewis over at Red Gate, the final one covering the OLTP Compression scenarios

So OLTP Compression doesn't help at all with the original scenario - UPDATE's causing row migration with additional "table fetch continued row" block reads, and increased storage. Your data warehouse will still have migrated rows from updates, and an increase in storage and disk reads as a result. The only viable way to force migrated rows to move back and be compressed is to use Basic Compression.

And finally, for completeness, OLTP Compression typically achieves a slightly lower compression ratio than Basic Compression. This is because they both use the same compression mechanism, and OLTP Compression leaves the value of PCTFREE at the table / partition level setting unchanged (10% by default) whereas Basic Compression sets PCTFREE to 0 by default during the compression, so achieving more compressed rows per block. The difference in compression level will be very small and minor, but it is there nevertheless.

Conclusion

Although Basic Compression is not "perfect", for a data warehouse it is by far and away a better choice than trying to use OLTP Compression of the Advanced Compression option. The performance penalties from the significant increase in CPU usage during data loads when using OLTP Compression are very severe. And ultimately it does not even address the original problem we were concerned about - UPDATE's causing row migration into different data blocks, resulting in additional "table fetch continued row" block reads from queries needing those rows. And that's why sticking with the free Basic Compression used on a scheduled basis after a data load can be much more efficient for a data warehouse, combined with a well designed partitioning strategy.

Appendix - Why Longer Rows Lead To More Frequent Compressions

I stated earlier that when using Advanced OLTP Compression that longer rows will both trigger block compressions sooner, and more frequently i.e. more compressions on average per row (or conversely, fewer rows inserted to a block on average between compressions). I will here assume a simple 2:1 compression ratio between an uncompressed row and its compressed form as it keeps the maths a bit simpler, but still shows what is going on.

If you can fit 50 uncompressed rows into a block then you can fit 100 compressed rows into the same block. As the compression halves the size of the uncompressed rows, the block is compressed multiple times when data is inserted into it. First 50 rows are inserted, which triggers a compression leaving half the block free, so 25 more rows can be inserted, triggering another compression, then 13 rows can be inserted, the block is compressed, etc. Until eventually 100 rows fit in, but another one will not.

This series of halvings of how many rows can be loaded at a time is really like a sum over the series of powers of the number 2. The last load will be 1 row, the one before that of 2 rows, 4 before that, and so on. This means that the block is "full" when 2 to a certain power is larger than the number of rows that can fit in that block i.e. after that many compressions another row will not fit in the block.

For 100 rows in a block it will be 2⁷ (2 to the power 7) of 128 needed, which is 7 compressions. If each block is compressed 7 times as it has rows inserted into it to achieve 100 rows in a block, then on average there is a compression every 100 / 7 = 14.29 rows.

For a much longer row that can only fit 10 uncompressed rows per block, then 20 compressed rows will fit in that block. And 2⁵ is 32 (2 to the power 5) which is larger than 20, so 5 compressions will be needed to fill each block. As 20 rows will have been loaded into a block then a compression is occurring on average every 20 / 5 = 4 rows.

This is a lot more frequent than the previous case i.e. over 3 times worse on average! Thus OLTP Compression on a data warehouse with larger rows leads to a much larger CPU penalty from all of the repeated block compression that will be occurring during data loads. And why the alternative of just doing a normal load followed by a one-off compression using Basic Compression can be much more efficient.

Top Oracle Monitoring Views & Tables

2018-04-21T16:06:00.000+01:00

Introduction

I can be a bit "old school" at times on some things, and I like to be familiar with the underlying technology being used by advanced features. So while using Enterprise Manager to see what is happening on your database can be easier and quicker than running queries manually, there may be times when EM is not available on a particular database or it doesn't provide a particular piece of information you are after. Knowing where this information is held in the database instance lets you go direct to the source to get what you need, which can either be quicker in some circumstances or the only option in other cases.

The key information about activity on a database instance is available in a handful of dynamic performance views. While using Enterprise Manager can be quicker and easier, don't be afraid of these dynamic performance views. Learn about them and what is in them and they can be your friends, and not your enemies.

Generally I break these activity related views down into current, real time activity happening now and historical activity that happened some time ago. And the historical activity can be further split into very recent as opposed to long ago activity, where "long ago" is probably more than an hour ago.

Current Session Activity

The main view has to be V$SESSION, which has a record per connected session. You can query V$SESSION to see what each session is currently doing (the SQL it is executing (if any), or if it is waiting or blocked, or idle), or query it repeatedly to see how a session changes over time (blocked or moving through different SQL).

This view has been expanded over the different versions of Oracle and now includes a richer set of data about what each session is doing spread across a number of useful columns e.g. STATUS, TYPE, SQL_ID, SQL_CHILD_NUMBER, ROW_WAIT_*, BLOCKING_SESSION, EVENT, WAIT_TIME, SECONDS_IN_WAIT, STATE.

When a session waits for something to complete, such as a disk wait, you can see this wait in V$SESSION. But this is transitory, being about the current wait that session is experiencing. These session wait times are also cumulatively recorded in V$SESSION_EVENT, which records both the wait event and the total wait time since the session connected to Oracle. You cannot just use the current, total values in this, as they are cumulative since the session was created. However you can compare the changes to this over a short period of time, which will tell you the events that the session waited on during that period of time, and how long it waited for them. Tanel Poder has given us the very useful snapper.sql script that snapshots this dynamic performance view in real time and reports on what changed i.e. the waits a session is experiencing.

Recent Active Session Activity

In version 10 (10.1) Oracle added in the new dynamic performance view of V$ACTIVE_SESSION_HISTORY (ASH), which contains point in time snapshot copies of data from V$SESSION. The idea is that you can now look back over recent history to see what was happening recently in individual sessions, letting you investigate issues soon after they have happened. Potentially it also lets you investigate temporary problems that do not persist long enough to investigate in real time using V$SESSION directly.

The manual says "It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn't belong to the Idle wait class." The one snapshot every second is to minimise any performance impact on the system and the storage requirements, and it also only saves the data for active sessions i.e. nothing is copied for inactive, idle sessions. The data from each snapshot is then held in memory within the SGA, and accessible via the dynamic performance view of V$ACTIVE_SESSION_HISTORY. There is a limit to how much memory it will use, but it is intended to cover the past 30 minutes.

Note that Active Session History and AWR (Automatic Workload Repository) are part of the "Oracle Diagnostic Pack", and a license for this must be purchased before you have the right to use either of these i.e. these features are not part of the core Oracle Database license. And such additional "Packs" are only available for the Enterprise Edition of the Oracle Database.

Most of the columns in V$ACTIVE_SESSION_HISTORY are the same as in V$SESSION, with the addition of a SAMPLE_TIME column. You can query this in similar ways to how you might query V$SESSION, but with the addition of a SAMPLE_TIME column, and that the same SESSION_ID can be repeated across different samples. Using ASH you can do things such as identifying the most frequent wait events, or sessions experiencing a specific wait event, or any session experiencing long waits.

Historical Session Activity

As stated, the data in V$ACTIVE_SESSION_HISTORY is held in memory in the SGA and is limited in size. Oracle will also automatically save 10% of this data to a table on disk in the AWR (Automatic Workload Repository), so you have an even longer period of history to query (up to the AWR retention limits) but at a lower sample rate. Even though the sample rate is much lower, the idea is that any peaks in activity or contention will still be captured and be available for analysis later on in DBA_HIST_ACTIVE_SESS_HISTORY.

The actual way it decides what samples to save is somewhat complicated i.e. not just a straight "1 in 10" of the samples from the in-memory ASH. It saves 1 in 10 of the sample sets when there was activity, but saves all of the session data from such a saved sample. This way you get all of the active session data from a saved sample, rather than just 1 in 10 of the active sessions and so be missing a complete picture of a sample set.

You can query DBA_HIST_ACTIVE_SESS_HISTORY in much the same way as you would query V$ACTIVE_SESSION_HISTORY, except that there are fewer samples covering longer periods of time.

Historical System Activity

The previous views are all about individual sessions, but Oracle also has a lot of information about other activity on the database as a whole in the other dynamic performance views. Most of these hold cumulative values - the data values only ever go up as new measurements are added in. In many cases their data is essentially a sum of the activity across all of the sessions in the system. These views can provide other useful information about what has been happening on a system. Oracle also takes snapshots of these data sets into the AWR, into a series of DBA_HIST_* tables, which you again need the Diagnostics Pack license for the right to query them.

The default snapshot interval is every hour, and they are retained for 7 days (these can be changed if desired, which can be useful). An AWR snapshot copies all data in all of the relevant dynamic performance views into their corresponding DBA_HIST_ table with a SNAP_ID column added in to identify the relevant snapshot. Unlike ASH, which is only a sampled subset, all of the data in the corresponding dynamic performance views is saved on each snapshot, so there are no issues over any missing data within a snapshot.

You can make use of this AWR data either by running a standard AWR report which summarises all activity between two snapshots, or by writing your own queries directly against these DBA_HIST_* tables for things of interest to you. The latter can be useful if you don't want to have to produce a set of AWR reports and then read through them manually to find something of interest. The right kind of queries can summarise key activity data across multiple snapshots, helping you identify any abnormalities or peak activity periods or anything else of interest.

In the same way that you might look at a dynamic performance view to see changes in real time, you can query the corresponding DBA_HIST_* table to see what changed over longer periods of time and between multiple snapshots. But your query will need to get two sets of snapshots (using the SNAP_ID column probably) and subtract the cumulative values of one from the other to get the change in value over that time i.e. the actual measure of the activity over that period of time. I've blogged about querying the AWR DBA_HIST_ tables directly before.

For reference here are a few dynamic performance views and the corresponding DBA_HIST_ table they get saved to by an AWR snapshot:

V$SYSSTAT (individual named system statistic counters) maps to DBA_HIST_SYSSTAT
V$SYSTEM_EVENT (wait event occurrence and time counters) maps to DBA_HIST_SYSTEM_EVENT and to DBA_HIST_BG_EVENT_SUMMARY for Background events
V$SQL has no direct map, but V$SQLSTATS maps to DBA_HIST_SQLSTAT

SQL Statements Executed

The other approach to investigating performance problems is to look at the SQL statements being executed, as opposed to what individual sessions are doing or the system as a whole. The most relevant dynamic performance views for SQL execution information are:

V$SESSION - contains the SQL_ID of the SQL statement a session is currently executing (if any)
V$SQL and V$SQLSTATS for all SQL statements recently executed, with execution statistics, etc.
- Data columns include:- SQL_TEXT, CHILD_NUMBER, USERS_EXECUTING, EXECUTIONS, FETCHES, BUFFER_GETS, CPU_TIME, ROWS_PROCESSED
- Remember that these activity counters are cumulative and simply increase in value on each execution of that SQL statement
- SQL statements are retained in V$SQL while they are being executed, but after that they can be removed for that slot to be reused by another SQL statement. So there is no guarantee that V$SQL contains all SQL statements recently executed
- However, the most frequently executed SQL statements will tend to remain in V$SQL

Summary

The following dynamic performance views and AWR tables can directly provide you with information about what is and what has happened on an Oracle database instance, and are extremely useful for investigating performance problems. Remember that to use some of these you must have purchased the Diagnostics Pack for your Enterprise Edition Oracle database software.

V$SESSION for what is happening now in each session
V$SESSION_EVENT for total waits a session has experienced
V$ACTIVE_SESSION_HISTORY for recent session history
DBA_HIST_ACTIVE_SESS_HISTORY for a sampled subset of session history over a longer period of time
DBA_HIST_SYSSTAT, DBA_HIST_SYSTEM_EVENT, DBA_HIST_SQL_STAT and DBA_HIST_SYS_TIME_MODEL in the AWR for full copies of the corresponding dynamic performance view taken at regular intervals
V$SQL and V$SQLSTATS for the most recently executed SQL statements, still present in the shared pool in the SGA

Hash Join Overflow to Disk Summary

2018-03-28T19:03:00.000+01:00

[I'm a bit busy at the moment working with a new client on a large and complex database, so I'm afraid I've only got time to do a quick post. I have a new post on a different topic planned, but am struggling with the time needed to write it up properly, so a short post is all I can squeeze in]

This is just a summary of the key results from my findings on how Oracle is costing a Hash Join when it overflows to disk, pulling all the conclusions together into one place.

In the first post I arrived at the following formula:

( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475

Where the "Columns Size" is the sum of the hash table storage for each column i.e. column data storage + 2 bytes per column, and the "Row Count" values are per 1,000 rows.

In the second post I showed that the costing was unaffected by the value of the Multi-Block Read Count parameter i.e. the costing was done only using single block disk I/O's.

In the third post I proposed a simpler formula using only values shown in an execution plan of:

( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485

This is only slightly less accurate than the original formula, but is much simpler to use as it only needs values reported in the execution plan, and no other calculations or derived values by you.

In the fourth post I showed that the reported size of the Temporary Space (TempSpc) in the execution plan is actually the size of the hash table itself, being dependent only on the size of the Build data set. And that knowing this hash table size will help you in deciding how much larger it is than the available memory in the PGA used to store such hash tables.

Hash Join Overflow Costing #4 - Temporary Space + PGA Use

2018-02-14T10:35:00.000+00:00

Now I want to look at the value reported for "Temporary Space" (TempSpc) and how to check whether the hash table is close to fitting in memory within the PGA or not i.e. just a bit too big or way too big.

Temporary Space

It seems that this is simply the size of the hash table itself, which is in turn the size of the Build data set plus an additional 12 bytes per row. Which means that we can use part of the cost formula posted earlier for this to estimate the size in KiloBytes:

Build KBytes + (12 * Build Rows / 1000)

The evidence I have for this is:

The "TempSpc" value only changes when the Build size changes - never when the Probe size changes
When additional columns are added to Build data set in the SELECT the "TempSpc" value increases by the same amount as the Build size increases (for the same number of rows)
When the number of rows in the Build data set increases, the "TempSpc" value increases by an additional 11.76 bytes per additional row
- Which is almost 12 bytes per row, as mentioned before as overhead in the hash table

You could of course "refine" the above formula and use "11.76" instead of "12" and that would give you a value for the hash table that is much closer to what is reported for the "TempSpc" in the execution plan. But I think that using 12 is good enough for only an estimate.

Hash Table Size vs PGA Size

I mentioned in the first post that the Optimizer uses the value of the internal initialization parameter "_smm_min_size" for the memory it can use as the work area for the hash table within the PGA. Thus if the size of the hash table is larger than this the Optimizer will cost the Hash Join as an overflow to disk operation. Using the formula posted above for the Temporary Space we can now estimate the size of the hash table needed and compare it to this minimum work area size value.

The notes I have on "_smm_min_size" say that it will be 0.1% of the PGA when the size of the PGA is greater than 128 MB, otherwise it will be 128 KB as a minimum. Assuming your PGA is larger than 128 MB then you can calculate what this is yourself. Note also that its value is reported in 10053 trace files along with all other initialization parameters, so you can directly verify its value if you want to.

At run time, though, the PGA may not be heavily used and may have a lot of unused memory available. An executing query can allocate and use more than the minimum work area value at run time for execution plan operations, up to a limit specified by the "_smm_max_size" initialization parameter. For PGAs larger than 1 GB this is set to be 10% of the PGA size. For smaller PGAs it is 100 MB, or 20% when the PGA is less than 500 MB.

Knowing how large the hash table needed will be, lets you compare it to the maximum possible work area size in your PGA and see what the likelihood is of the hash table being able to fit within the PGA at run time. As the ratio between the minimum and maximum possible work area sizes is 100 (0.1% to 10%) this is a large range and a hash table may have a reasonable chance of fitting in the PGA at run time, depending on how active your system is and the other queries running.

Equally it lets you consider what the benefit might be from increasing the size of the PGA significantly if you have a critical query involving a large hash join operation that is overflowing to disk. That said, there are some ways of rewriting queries that can minimise the size of the data being pulled through into the hash table used by the hash join, and so potentially get a smaller hash table to fit into memory in the PGA. But this depends on multiple factors, mainly being the number of columns of data coming through in the Build data set. If this is only a few columns then there won't anything that can be saved by such a query rewrite.

Hash Join Overflow Costing #3 - Simpler Formula

2018-02-06T11:40:00.000+00:00

So far I have offered a formula for how a Hash Join that overflows to disk is costed and confirmed that this is only costed in terms of single block disk operations. While that formula produces very accurate results (less than 1% difference to the reported cost for the Hash Join operation) it requires you to obtain size information on each individual column being retrieved from both the Build and Probe data sets. And this could be quite tedious when there are many columns involved, or you don't have an easy way to work out the source columns involved in each data set. There is a simplification we can make to eliminate this detailed per column information and just use the information reported in the execution plan itself.

All of the columns being retrieved for the Build and Probe data sets are actually included within the "Bytes" value reported in the execution plan for each data set. And this value includes both the individual column data and any per-column overhead as well. What it does not include is the extra 12 bytes per row overhead needed in the hash table itself. We can approximate the size of this second part of the data by using the "Rows" value reported in the execution plan.

Examining the "Bytes" and "Rows" values in the execution plans for my test queries (see first post for details of these queries), I can see that the increase in the Hash Join cost is about 0.0485 per KB of increase in either data set (Build or Probe) for the same number of rows. Previously I determined that there was a 12 byte overhead within the hash table per data row expected.

This produces a revised Hash Join cost formula using only values from the execution plan of:

( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485

Note that the "Bytes" values used need to be in "K" units, whereas the "Rows" is not and so is divided by 1000. Checking this formula against the actual Hash Join costs of the test queries I ran I can see that it has an error of about 1% i.e. it is not as accurate as the previous formula, but is still accurate enough I believe.

Lets check how close this from one of the test queries. Here is the execution plan produced:

--------------------------------------------------------------------------------------
| Id  | Operation     | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      |      |      | 1266 (100)|          |
|   1 |  SORT AGGREGATE     |      |    1 |   30 |      |           |          |
|*  2 |   HASH JOIN     |      | 50000 | 1464K| 1712K| 1266   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| HJ50K  | 50000 | 1123K|      |  365   (1)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| HJ100K |  100K|  683K|      |  728   (1)| 00:00:01 |
--------------------------------------------------------------------------------------

We can see:

The Build and Probe data access cost is 365 + 728 = 1093
The cost of the Hash Join operation itself is 1266 - 1093 = 173
The calculated approximate cost is (1123 + (12 * 50) + 683 + (12 * 100)) * 0.0485
- = (1123 + 600 + 683 + 1200) * 0.0485
- = 3606 * 0.0485
- = 174.891
The difference is +1.891, which is 1.1% of the actual cost

This formula can therefore be a useful check against the cost reported when a Hash Join operation overflows to disk, and for determining which is the biggest cost factor i.e. the Build or the Probe data sets.

Hash Join Overflow Costing #2

2018-01-23T10:22:00.000+00:00

In my previous post on costing a Hash Join overflow to disk I came up with a formula for the cost of a Hash Join operation that overflows to disk based on tests I had done, but I also mentioned that there might be other factors involved not yet exposed by my testing. My main concern was whether the disk I/O's involved were all of the same type, or a mix of single block and multi-block disk I/O's. I've now done some further testing, and have something to share.

The tests show that all the disk I/O's are costed by the Optimizer as single block disk I/O's. I say this because I changed the cost of multi-block disk I/O's and the cost reported for the Hash Join operation itself did not change. Hence it must be costed as single block disk I/O's by the Optimizer.

The tests I was doing were a join between two tables that used a Hash Join in the execution plan (see previous post for the SQL used in this query). The key points are:

The two source tables are accessed by Full Table Scans (TABLE ACCESS FULL) in the execution plan
These feed into a HASH JOIN operation
Which in turn feeds into a final operation - a SORT AGGREGATE for the SUM's in the SELECT output
By adding together the costs of the source data accesses and subtracting from the total cost reported for the Hash Join, we get the net cost of the Hash Join operation itself

Oracle costs disk I/O operations using a set of statistics it has about the computer system it is running on - termed System Statistics and stored in the database itself. These can be a minimal set of just a few statistics or a more complete set of statistics providing a more detailed breakdown of different types of operations on the computer.

In my Oracle 12 instance (12.1.0.2.0) I only have the "No Workload" default system statistics set in the database, which is the minimal set of statistics. This provides values for the following:

IOSEEKTIM - Time in milliseconds for a disk to move the read / write head to the track you want to access - default value of 10
IOTFRSPEED - Bytes transferred per millisecond once a transfer starts - default value of 4096
MBRC - Multi-block read count in blocks achieved on average - default value of 8
CPUSPEEDNW - Speed of CPU in units of millions of cycles per second (not relevant here)

Note that the MBRC here is not the same as the initialization parameter db_file_multiblock_read_count, even though their definitions seem to be the same. This MBRC is not set from the initialization parameter, and remains the same within the system statistics until explicitly changed (one way or another). MBRC is intended to represent the actual size of multi-block reads that were achieved on your computer system, and is set when Workload statistics are collected or when set manually.

The Optimizer uses these minimal system statistics to derive the values for the elapsed times for single and multi-block reads (SREADTIM and MREADTIM respectively). The formulae it uses are:

SREADTIM = IOSEEKTIM + (DB_BLOCK_SIZE / IOTFRSPEED)        
MREADTIM = IOSEEKTIM + (MBRC * DB_BLOCK_SIZE / IOTFRSPEED)

With the default values for System Statistics as stated, and a database block size of 8192 (8 KB) this gives values of SREADTIM = 12 ms, MREADTIM = 26 ms.

What this really means for the Optimizer is that a multi-block read takes 26 / 12 = 2.167 times longer than a single block read i.e. its "cost" is 2.167 times that of a single block read. This is used by the Optimizer when costing multi-block read based operations - it "converts" the multi-block read time into an equivalent number of single block read operations, which all have a cost unit of 1 and are the basis for how the Optimizer reports and compares execution plan costs.

If I were to increase the MBRC from 8 to 16, then the MREADTIM would increase to 42 ms, and a cost factor of 42 / 12 = 3.5 times that of a single block read. The relative net cost though would actually decrease, because each multi-block read would now be reading 16 blocks in one disk I/O rather than 8 blocks i.e. half the number of disk reads are needed to read the same number of blocks in from disk, at a relative cost ratio of 1.615 (3.5 / 2.167). So the decrease in the number of disk reads is greater than the relative increase in cost per disk read.

If the Hash Join operation involved multi-block disk reads then changing MBRC would change the relative cost of those disk reads and we would see a change in the net cost of the Hash Join as reported in the execution plan.

I changed the value of MBRC manually and then shutdown and restarted the database:

exec dbms_stats.set_system_stats ('MBRC', 16)
shutdown immediate
startup

This was to ensure that nothing was still present in memory in the SGA - neither any existing execution plans, or any system statistics used to cost those execution plans.

When I ran the same test queries again, the total costs decreased because the source data accesses are full table scans which use multi-block reads i.e. this was expected, and so the total cost reported at the Hash Join step decreased. However, when the costs of the source data accesses were subtracted from the total cost after the Hash Join step, the net cost of the Hash Join operation itself was exactly the same as it was before for the same query.

Likewise when I increased the MBRC value to 32 there was no change in the net cost of the Hash Join operation when it overflowed to disk.

The conclusion then is that no multi-block disk reads are used within the costing of the Hash Join operation by the Optimizer when the hash table it uses overflows to disk.

Which means that the formula I posted before for an overflowing Hash Join cost is not dependent on multi-block reads at all:-

( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475

Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.

Hash Join Overflow Cost Formula #1

2018-01-11T15:02:00.000+00:00

Introduction

The Hash Join join method was introduced in Oracle version 7 (7.3 specifically I believe), and one of its main goals was to be a method that lent itself well to being parallelisable. However, it is such an efficient join method for larger data volumes even in serial execution that it is often picked by the Optimizer over Nested Loops or Sort Merge because of its lower execution cost. This makes the Hash Join method probably the most frequently used method by the Oracle Optimizer, often appearing in execution plans for SQL queries.

A Hash Join works by building a table in memory containing the data from the first data set (termed the Build data set), and then reading the second data set (the Probe data set) to lookup each data row into the in-memory table for any match (join). The in-memory table is structured and accessed by applying a "hash function" to the relevant join data columns. This hashing has various benefits around performance, handling any kind of input data value range, and distributing the input values within the table (minimising bunching values together).

When this hash table fits in memory the additional cost of the Hash Join operation is negligible because it only involves CPU and memory operations and these are multiple orders of magnitude faster than disk accesses i.e. there is often little or no difference to the total reported cost of the Hash Join over the sum of the costs of access of each source data set. However, when the hash table needed is larger than can fit in available memory in the PGA then it must overflow to disk, which in turn significantly increases the cost of the Hash Join operation itself.

A question I have had for a long time is "How does Oracle cost this overflowing Hash Join operation"? Can I replicate this cost formula and understand what the main factors are within this reported cost? Which is a bigger factor - the size of the Build data set or the Probe data set? Knowing such things it might offer the possibility of gaining some insights into ways of tuning such large hash joins. At the least I would know more about how the overflowing hash join actually works in practice.

Jonathan Lewis gives a description in his Cost Based Oracle Fundamentals book of how the overflowing Hash Join operation works, with a formula for the major cost components involved. However, I have always found this formula to be more descriptive than quantitative, and to not be easy to use to arrive at a comparable value to what the Optimizer has reported.

I would like a more straightforward quantitative formula that I could use myself to estimate whether a Hash Join will overflow to disk or not, and how much its cost will be. After some research I believe I have arrived at such a formula which I will share here. Note that I'm not saying this is a "perfect" formula, just that this is the conclusion I have arrived at so far as a result of the tests I have done, and it seems to fit the results I have very well. I'll continue to post more details when I refine or revise this in the future

Why Hash Join Overflows

The limit to the size of a single Hash Table or other "work area" in the PGA is determined by a hidden, internal initialization parameter of "_smm_min_size". If the size of the Hash Table needed would be larger than this, then the Optimizer assumes that it will overflow to disk and costs it accordingly. My notes say that the value of "_smm_min_size" is the larger of 128 KB or 0.1% of PGA_AGGREGATE_TARGET. I cannot find exactly where I got this information from, but my memory is telling me that it was from Randolf Geist, possibly in a response to a question on one of the Oracle forums.

The main reason Oracle limits the potential size of a Hash Table is to ensure that a high number of other queries can run concurrently and not be starved of memory within the PGA. The Optimizer is assuming a worst case scenario to make sure other sessions do not suffer when a very large query is executed by one session. However, it is possible that when executed the hash table will not overflow to disk i.e. if at the moment that query is executed there is enough free memory in the PGA, then Oracle will let that session have a larger "work area" than the value of "_smm_min_size". So even though the Optimizer has costed the Hash Join operation as an overflow to disk and costed it accordingly, it does not mean that it will always overflow to disk when executed.

How Overflow Hash Join Works

Jonathan Lewis gives a description in his book of how the Hash Join works when it overflows to disk. I won't repeat the details here as they are not particularly relevant at the end of the day. But a crude summary would be that:

The first data set is read once and broken into "chunks" that are written out to disk (temporary storage), where each chunk is a contiguous sub-set of the overall hash table
One or more of these chunks are then kept in memory ready for the pass over the second data set
The second data set is read in and hashed in the normal way:-
- If it hashes to an in-memory chunk then it is matched as normal
- Otherwise it is written out to disk (temporary storage) split out into chunks on the same basis as the first data set
Then remaining chunks of the first data set are read into memory, and the corresponding chunks of the second data set read again and matched as normal. This is repeated until all of both data sets have been processed.

Essentially this is saying that there will be an additional pass over both sets of data - after the first read of each data set (already costed within the execution plan), there is an additional write out to disk of the majority of each data set, followed by a read back of each data set. Also extra reads and writes may be needed in the first pass of each data set, to keep the data in the "chunks" properly grouped together on disk.

It is not clear whether these write and read operations will be single block or multi-block disk read operations, or how many of them there will be. Potentially multi-block reads could be used when reading back in the pre-hashed chunks from disk. Luckily though this turns out to be irrelevant to the cost formula I have arrived at.

Deriving Hash Join Overflow Cost

Here is how I went about it. I created a set of standard test tables (see later for SQL DDL), each with a mix of NUMBER and VARCHAR2 columns to pad them out a bit, and populated them using a repeatable "connect by" data generator with a different number of rows in each test table. I then ran a query joining 2 of these tables together (see later for SQL), immediately examined the execution plan (from dbms_xplan.display_cursor) and noted the costs of each operation.

Without any indexes on any of these tables the execution plan was always 2 Full Table Scans feeding into a Hash Join operation. When the smaller, build table became large enough the Hash Join operation would overflow to disk, causing its cost to rise significantly, and a "TempSpc" column to appear in the execution plan with a reported value.

By varying only one thing at a time between queries I could see how the Hash Join cost changed when it was overflowing to disk. I was not interested in those executions where the Hash Join did not overflow to disk i.e. where the hash table did fit in memory. Only those executions that involved the Optimizer assuming it would overflow to disk. By examining the change in cost for the Hash Join operation for a corresponding change in only one of the joined tables I could deduce a multiplying factor being used within the underlying Hash Join cost calculation.

My Oracle version is 12.1 on Oracle Linux, so my results are only guaranteed to be accurate for that version. I would assume the results should be the same for 11g, as I don't think anything significant has changed in how the Hash Join operation is costed, but that would need to be verified.

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE 12.1.0.2.0 Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

Hash Join Overflow Formula and its Accuracy

I started by comparing the Hash Join cost when the number of columns in a data set changed i.e. the size of the data set increased for the same number of rows. Jonathan Lewis states that the memory needed per column is the storage size of the column itself plus 2 additional bytes. I observed that the Hash Join cost changed by a factor of 0.0475 per byte per 1000 rows. And that this multiplying factor was the same under different row counts in either table in the query.

This only involves the columns needed by the query itself, which the Optimizer must extract and process, and not all of the columns in the table. In this case it is the columns referenced in the "SELECT" list and those referenced in the "WHERE" clause. And the "storage size" is the number of bytes that Oracle uses to store that data on disk, which is not necessarily 1 byte per value or 1 byte per character or digit. For instance, a NUMBER is stored as 2 digits per byte.

My other observation was that when only the row count in one table changed the Hash Join cost changed by a factor of 0.5675 per 1000 rows. As this was a constant per row I wondered if this was something to do with some extra data per row causing extra disk I/Os. And 0.5675 divided by 0.0475 gives 11.9474 which is almost 12, implying a 12 byte overhead per data row within the Hash Table.

Based on this, I arrived at the following formula for an overflowing Hash Join cost:

( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475

Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.

I then checked the calculated costs from this formula against the series of test queries I had been using, and the resultant cost for the overflowing Hash Join came out almost the same in all cases. The percentage difference was under 1% in almost all cases, which I take to be a high degree of accuracy. The only anomalies are for when the "build data set" is only just bigger than can fit in the PGA, but even then it is only a 3% difference. As the size of the build data set increased so the percentage difference decreased.

On the one hand it might not be surprising to some people that my derived formula produces the same results using the same inputs that were used to create the formula in the first place. However, given that the queries I tested varied both the number of columns being selected, the number of columns being joined on, and the number of rows in each table, these would appear to cover the only variables relevant to this formula. And in each of these cases the change in the reported Hash Join cost from Oracle was always a multiplier of this fixed constant of 0.0475.

Other Factors

While I do believe that this formula is true and valid for the system I was testing on, it may not be true for all other systems. It is likely that the multiplying factor of 0.0475 will be different on other systems. Given that this additional cost for the overflowing Hash Join is due to the additional disk I/Os involved, then it would seem likely that changes to the system statistics inside Oracle for disk read times would result in a change in the value of this multiplying factor. I will investigate this in my next series of tests.

There may or may not be some small "constant cost value" involved as well within the formula, for some kind of constant overhead within the overflowing Hash Join operation. This "constant cost value" would become negligible at higher data volumes compared to the costs for the build and probe data sets, but it might explain the slightly larger difference in calculated cost at the smallest overflowing data set size.

There is also the concept of "one pass" and "multi-pass" hash joins within Oracle, as well as "optimal" hash joins. I don't understand the real difference between these, other than "optimal" is when it fits in memory and the other two are when it overflows to disk. It is possible that what I've seen has been the cost for "one pass" overflowing hash joins, and for even larger data sets a "multi-pass" hash join would be used that would involve a different cost formula.

The SQL for the table and query

Here is the SQL to create one example table - they are all the same but for name and row counts - and the query used.

Create Table - run from SQL*Plus with 2 command line arguments of the row count and a table name suffix e.g. "@crhjtab 1000000 100k".

create table hj&2
tablespace testdata
as
select r pid
     , 1 one
     , 2 two
     , 3 three
     , 4 four
     , 5 five
     , 10 ten
     , trunc (r / 10) per10
     , trunc (r / 100) per100
     , mod (r, 10) mod10
     , mod (r, 100) mod100
     , mod (r, 1000) mod1000
     , mod (r, 10000) mod10000
     , mod (r, 100000) mod100000
     , 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' filler1
     , 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler2
     , 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler3
  from (select rownum r
          from (select rownum r from dual connect by level <= 1000) a,
               (select rownum r from dual connect by level <= 1000) b,
               (select rownum r from dual connect by level <= 1000) c
         where rownum <= &1) 
;
exec dbms_stats.gather_table_stats (user, upper ('hj&2') )

Query - run from SQL*Plus with 2 command line arguments of table name suffixes e.g. "@hj0 100k 200k"

select /* HashTest0 */ sum (hj1.one) sum_b1
  from hj&1 hj1, hj&2 hj2
 where hj1.pid = hj2.per10 
   and hj1.mod10 = hj2.mod100 ;
--
select * from table (dbms_xplan.display_cursor) ;

I've only shown one query here, as the others I used are almost the same but for the columns in the "SELECT" list. The variations of this query had different numbers of columns in the "SELECT" list, to increase the number of columns from the build and/or the probe tables.

Reading AWR Reports #2 - Report Overview

2018-01-03T15:10:00.000+00:00

In the first post on Reading AWR Reports I made the point that you should first be clear on the details of the "performance problem" you are investigating. If there is no specific problem reported then there is no point looking for something that is not there in an AWR report. I also stated that an AWR Report is one amongst several tools available to you that you can use to investigate Oracle database performance problems, and you should make sure that a database wide AWR Report is the most suitable tool to be using for the specific performance problem you are currently tasked with investigating.

Assuming that has all been addressed, then the first thing I do with an AWR Report is a few high level checks - a kind of sanity check to get a feel for whether there might be some kind of performance issue there or not.

A performance problem is normally when a task takes too long to complete, and that is affected by the various resources it uses while doing its work. The key measurements are therefore both sides of this - the input work requests, and the system resource usage while doing that work. I basically want to check:

The resource usage as a percentage of system capacity i.e. utilisation
The amount of work requested / done (assuming they are the same), which is really SQL statements executed for a database
Amount of "wait time" within "SQL execution time" i.e. wait as percentage of work time
Top wait events to see how these correlate with the previous observations

That's it. Nothing more detailed for the first pass. If the input workload is high enough and the resource utilisation high enough and something looks wrong then I follow up with a second pass of the AWR Report diving down into more details based on what the first pass showed up.

Lets use the following AWR Report to show what I mean:

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59 12.1.0.2.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        79 05-Dec-17 09:53:43       142       3.0
  End Snap:        80 05-Dec-17 10:03:44       142       4.1
   Elapsed:               10.03 (mins)
   DB Time:                0.39 (mins)

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                           12.4              53.8
log file sync                        2,727        9.3       3.42   40.4 Commit
db file sequential read                177        1.2       6.82    5.2 User I/O
control file sequential read           827         .1       0.12     .4 System I

The snapshot duration was 10 minutes, and the system had 2 CPU Cores, so there were 20 minutes of CPU capacity available. The database processing time is reported as "DB Time" and is 0.39 minutes, which is about 1.95% resource utilisation. From this I can see that the database was doing very little work at all, so there is not a database wide performance problem worth investigating. Even though the Top Wait Events seem to show that 40% of the time was spent waiting on the "log file sync" event, the actual amount of Time is very trivial - 9.3 seconds of waiting out of a 600 second period (10 minutes). Such low workloads and resource utilisation can lead to various measurement anomalies, so it is not worth trying to drill down further into the wait event details. Any actual problem is probably specific to a single session, and should be investigated using session specific tools.

Another AWR Report:

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59 12.1.0.2.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        71 05-Dec-17 07:55:35        39       1.8
  End Snap:        72 05-Dec-17 08:05:39        40       2.0
   Elapsed:               10.06 (mins)
   DB Time:                9.35 (mins)

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
      Redo size (bytes):       3,470,831.7           5,403.2
  Logical read (blocks):          61,507.4              95.8
          Block changes:          20,563.9              32.0
             User calls:             495.9               0.8
           Parses (SQL):             302.6               0.5
         Executes (SQL):          13,287.8              20.7
           Transactions:             642.4

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                          372.9              66.5
log file sync                      272,633      196.8       0.72   35.1 Commit
db file sequential read                472        2.2       4.75     .4 User I/O

Again, same database instance and a 10 minute period of time but a different workload. Now we see that the resource utilisation is up at 46.47%, being 9.35 minutes of DB Time out of 20 minutes of potential CPU capacity. From the Load Profile we can see that the database was executing over 13,000 SQL statements per second. From the wait events we see that "DB CPU" is 66.5% of the "DB Time" i.e. doing real work. So waiting as a percentage of SQL execution time would be about 33.5%. And we can see that almost all of this wait time is due to a single wait event - "log file sync".

Again, this does not directly tell us whether there is a performance problem or not. In all databases doing work there will be some wait events occurring, and one of these will always be top when ordered by wait time. That does not mean that it is necessarily a problem, or that it can be "fixed" in any way. In this case the workload is doing a lot of inserts and updates in small transactions, so there is a high frequency of commits which all involve writes to the redo logs. And we would expect the "log file sync" wait event to be there in the set of top 10 wait events in this scenario i.e. it is not a "problem" but a "result" of the workload running on the database.

Reading AWR Reports #1 - Start With The Problem

2017-12-12T14:57:00.000+00:00

I hope that this will be the start of a short series of posts on how to read an AWR Report. I doubt that it will contain anything revolutionary, but it is clear that some people do not know where to start when presented with an AWR Report, so I aim to cover some essentials about this. This first post will cover what to consider before attempting to read and understand an AWR report, because ultimately AWR is just another tool and not the answer itself. I will get into the first things I look at in an AWR report in the next post.

As I've already indicated, AWR (Automatic Workload Repository) is really another tool in your toolkit for examining what happened on an Oracle database. And it is an extra cost option for the Oracle database Enterprise Edition, requiring that you purchase the Diagnostics Pack licence. Before using AWR you need to make sure that AWR is the right tool to use for the problem you have, rather than some other tool available to you. Which means that you should always start with the problem you are trying to fix, and not dive straight into using some tool or other and hope that it shows you the fault in the first screen's worth of output. You must be clear on what the problem is that you are trying to fix, the impact of that problem on users or the application or something else, and how you would measure the improvement to show that the problem has been resolved. If you don't know that information about the "problem", then you can never be sure that you have actually fixed the real cause of it - it may simply have disappeared for other reasons.

I'd strongly recommend following a top down approach to a performance problem investigation - start with the real world problem that users are experiencing, and work down and inwards from that towards the database. Do not start at the bottom directly on the database, and hope to find the problem quickly there. This is a "finding a needle in a haystack" approach. You might get lucky and find something related to the problem straight away, or you might not and end up wasting your time chasing dead ends. The "performance problem" may actually be somewhere else outside of the database, such as the client application itself, the network, or any application servers. Check and eliminate those first. Only after checking those should you move onto the database itself.

An AWR Report tells you what happened on your database between the two snapshots used. It does not tell you whether any problems did or did not occur, or what those problems might be. And it does not tell you whether your database was well behaved or not. It just provides a complete overview of everything that happened on that database between those two snapshots, and it is up to you do decide whether this meets expectations or not.

AWR reports are database wide i.e. across all the activity that happened on that database between the two snapshots. This includes activity from all sessions on all schemas in the database. So it is only really of use if the problem you are investigating is a system wide one, affecting most of the sessions and users of the database. And even then it is still not guaranteed to show you what the cause of your problem is, as it depends on what your problem is.

Furthermore, an AWR Report only tells you the grand totals of what happened over that period of time, and the averages of those (some total value divided by the elapsed time or some other measurement). It does not tell you the peak values, as it only knows the absolute change in measurements between the two snapshots being reported on. And average values over a period of time are always lower than the peak values during the most active period, because that is what an "average" means. How much lower the average values are than the peak values depends on a number of factors, such as the different time periods involved and the relative difference in activity between the peak period and the other periods.

A pitfall to avoid is just looking at the "top 10" type lists of SQL statements executed and wait events. In every database there is always a slowest SQL statement or top wait event, regardless of how active the database is. Just because that was the "slowest SQL statement" does not mean that it is a bad statement, or that it is related in any way to the problem you are investigating. You need to prove a link between the two, and not assume that they are related in any way. Trying to "improve" a "slow SQL statement" that is unrelated to the real problem users are experiencing, is just a waste of time. Check that any potential cause is really related to your problem, and not independent of it.

Rather than wade through a series of full single AWR Reports for each pair of snapshots you can instead Query the AWR data directly (again, providing you have paid for the Diagnostics Pack licence), and pull out just the key performance measurements you are interested in across a number of snapshots all at once. These might include Average Active Sessions, number of SQL statements executed, and percentage of time waiting. This can help performance investigations progress faster if you can more quickly identify the peak activity period across all the AWR snapshots available, and then drill down into only the most active period of time.

Remember that AWR provides database wide information over relatively long periods of time (between AWR snapshots), and that there are other tools out there for doing different types of performance analysis of Oracle databases:

Real time query and analysis of Oracle Dynamic Performance Views (V$ views), such as V$SESSION
Active Session History (ASH), which also requires the Diagnostics Pack license
SQL Trace to record all SQL statements executed by a session, and their elapsed times

Each of these have their advantages and disadvantages, and if you want to be an Oracle expert then you should be aware of these and their tradeoffs. You need to make sure that you are using the right tool for the kind of performance problem you are investigating.

In conclusion then, I'd say that my two main points in this post are:

Make sure you are clear on what the performance problem is that you are investigating
Check that AWR Reports are the right tool to be using to investigate this performance problem, rather than something else

Chatty Applications and Simple SQL

2017-11-22T09:50:00.002+00:00

One type of "poor performance" scenario I have come across a few times is due to what I call "chatty applications". These are applications that execute a disproportionally high number of what look like very simple SQL queries for every business transaction they do. And often this is a deliberate design choice by the application architects and developers, claiming that simpler SQL statements on single tables using indexed columns always leads to efficient execution plans, so it must be better than anything else. Unfortunately it is the high volume of execution of such "simple SQL queries" that is the cause of the poor performance these applications often experience. Both in terms of poor response times for individual users, and poor total throughput across all the users. Why then are such "chatty applications" the cause of poor performance and not the solution?

First, there are network round trips needed for each SQL statement executed, and these are typically milliseconds in terms of order of magnitude of elapsed time. The SQL statement text is sent from the client to the database server, and then the resultant data sent back from the server to the client. And when there are hundreds of SQL statements being executed per business transaction, then these all add up. Executing a "business transaction" in such a "chatty application" can end up spending more time transferring data back and forth over the network than actually doing work with that data on either the client or database server system.

Second, the high SQL execution volume results in high library cache usage and contention for access to it inside Oracle. For each SQL statement executed, Oracle will check if that SQL is already in the library cache (in the shared pool in the SGA) and if so will go ahead and use the associated execution plan. Such library cache look ups involve various temporary internal locks on the library cache, to make sure it doesn't change while your session is examining it. And the more SQL statements being executed leads to more library cache lookups and more contention for the controlling locks on it.

Third, the volume of data transferred to the client and then processed on it can be much higher than if a single, more complex SQL statement was used instead. This is also about network delays, but this time it is due to the amount of intermediate result data being transferred to the client, as it works its way through data from different tables as part of its "business transaction".

Ultimately this all a variation on the "row by row" processing mind set where all the raw data is pulled back into the client, which further processes it such as joining together data from multiple tables. And a "row set" based approach can be more efficient - executing fewer but more complex SQL statements on the database server, to send only the final resultant data back to the client.

I have seen this kind of scenario in action at a number of customers, where it was their application design that was slowing down and limiting the performance of their application. In some cases we were able to identify the highest use cases in the application and successfully modify it to use fewer, more complex SQL queries instead, so that more work is done per execution by far fewer SQL statements. In other cases it was not as clear due to the application developers using a "glue layer" of third party software to handle the application to database mappings and constructing the SQL queries to be executed in each case. Fixing this requires reconfiguration of the "glue layer" to issue more complex SQL queries involving multi-table joins. However, in at least one case the customer did not at that time have enough expertise in how to use this "glue layer" to achieve that. They had just used it in an "out of the box" type fashion, and were now suffering as a consequence of that.

Probably the worst case I saw was an application using another "glue layer" that only ever issued SQL queries for a single column of data at a time of the form "select one-column from table-name where pkid_col = value", and consequently did multiple such SQL queries to get each individual column it needed. And obviously this led to horrible performance and scalability. Luckily the application had been written to use a well defined internal API for data access, which was in turn wrapped around the "glue layer". We were able to insert an extra layer between the application's API and the "glue software", intercepting the application API data access calls before the "glue software" was called. We implemented a caching layer here, based on the primary key of a record from each table, which was always used on each query. For a new record query, with a different primary key value, the API now did a "select *" into a local record buffer of the whole data record, and then returned the one individual column requested. If the primary key requested matched that already in the current record buffer, then the code just returned the individual column value directly from that, avoiding any further SQL execution and executing any code in the "glue layer".

As you might expect performance increased significantly as a result of this change - replacing the multiple single column SQL queries with just one whole record SQL query followed by in-memory, local accesses, avoiding all those network round trips. Also the net load on the Oracle database server decreased significantly, with less contention on the library cache, and the CPU and disk utilisation decreased significantly as well. Consequently the performance of all the other SQL statements being executed on that database server improved as well due to those resources being freed up. A definite "win-win" all around, by accessing the data in the database in a much more efficient way.

Redundant Grandparent Foreign Keys and Cardinality Estimate Errors

2017-11-16T10:02:00.000+00:00

This post is about how a slightly de-normalized database design involving redundant foreign keys to other tables can end up producing sub-optimal execution plans for queries that use those extra joins as additional filter conditions.

By default the Oracle Optimizer assumes that different columns of data in a table are independent of each other, and that their data values are not correlated with each other in any way. When a query has filter conditions on multiple columns in a table, the Optimizer will combine their filter factors together assuming they are independent of each other i.e. unless it has explicit information to the contrary. So if each filter factor was estimated to match 1% of the rows in the table, then their combined filter factor would be to match 0.01% of the rows in the table (1% of 1%).

However, if the data in these columns is in fact correlated then this filter factor estimate will be wrong. A classic example is "month of birth" and "zodiac sign". Each has only 12 possible values, and the Optimizer will assume that there are 144 possible combinations. But in fact there are only 24 possible value combinations because each zodiac sign straddles only two months. Assuming a uniform distribution amongst these 24 possible values within the data rows in the table, then each pair of values would match 1 / 24 or 4.17% of the rows in the table. However, the Optimizer will assume that a pair of values would match 1 / 144 or only 0.69% of the rows in the table. Such misestimates can make the Optimizer choose a relatively inefficient access such as using an index instead of a full table scan, due to estimating too few rows to match the filters and a too low cost for the data access.

Often this kind of correlation between data values in columns in a table occurs naturally, and there is little you can directly do about it. The "month of birth" and "zodiac sign" is one such example. However, this kind of correlation between columns can also occur as a result of how the tables have been designed, and can result in sub-optimal execution plans being produced.

One scenario is adding a redundant column to a table as a foreign key to an indirectly related table, such as a "grandparent" table that is the parent of the table's direct parent. A specific example would be something like putting the "customer / account identifier" into the "order line item" table, as a redundant copy of that in the "order" table, along with the "order identifier". Or putting both the "product main category id" and "product sub-category id" into the "product" table, when there is a hierarchy of categories.

Again, in these cases, the Optimizer will assume that such columns are independent of each other, unless it has explicit information otherwise. The danger now is that the Optimizer will produce different execution plans for what are essentially the same queries depending on whether the redundant joins are included or not.

Lets look at a solid example. I created three tables each with a parent / child relationship between them, and more rows in the child tables than the parent tables. I won't post the SQL DDL for these as it will make the post too long, but their structure and contents should be straightforward enough, and all values are uniformly distributed with an equal number of child records per parent record. There are indexes on the primary key columns, and the individual foreign key columns, but nothing else. Primary key and foreign key constraints are explicitly specified.

The tables are Grandparent (1,000 rows), Parent (100,000 rows), and Child (10,000,000 rows) i.e. 100:1 ratios, and each has a primary key column named "pkid". The Parent table has a foreign key column (gp_id) to Grandparent, while the Child table has a foreign key to Parent (of p_id) and also a redundant foreign key to Grandparent (of gp_id).

I will now execute three different but equivalent queries against these tables using different combinations of the possible foreign key joins:

Only direct joins from child to parent, and then parent to grandparent
As previous query plus additional (redundant) join from child to grandparent
Direct from child to grandparent only, and parent table omitted from query

Each query has a filter condition on another column in the grandparent table restricting the matching rows to just 5 rows in that table.

I'm using Oracle 11gR2 on Oracle Linux:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production

First query - direct joins between parent / child tables:

SQL> @test1

 ROW_COUNT
----------
     50000

SQL_ID  fddwv6gp5m26h, child number 0
-------------------------------------
select sum (c.one) row_count   
from child c, parent p, grandparent gp
where c.p_id = p.pkid   
  and p.gp_id = gp.pkid   
  and gp.pct05 = 1

Plan hash value: 4174412392

------------------------------------------------------------------------------------
| Id  | Operation            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |       |       | 25668 (100)|          |
|   1 |  SORT AGGREGATE      |             |     1 |    25 |            |          |
|*  2 |   HASH JOIN          |             | 49591 |  1210K| 25668   (1)| 00:05:09 |
|*  3 |    HASH JOIN         |             |   500 |  8500 |   244   (1)| 00:00:03 |
|*  4 |     TABLE ACCESS FULL| GRANDPARENT |     5 |    40 |     5   (0)| 00:00:01 |
|   5 |     TABLE ACCESS FULL| PARENT      |   100K|   878K|   238   (1)| 00:00:03 |
|   6 |    TABLE ACCESS FULL | CHILD       |    10M|    76M| 25398   (1)| 00:05:05 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C"."P_ID"="P"."PKID")
   3 - access("P"."GP_ID"="GP"."PKID")
   4 - filter("GP"."PCT05"=1)

We can see that it is estimating 5 rows to be retrieved from the Grandparent table, 500 from the join to Parent, and about 50,000 from the join to Child - operations 4, 3, and 2. This is correct given the 100:1 ratio between the rows in each table. The cost is estimated at just over 25,000, which is really just the sum of the costs of the full table scans involved.

Second query - redundant join added between Child and Grandparent:

SQL> @test2

 ROW_COUNT
----------
     50000

SQL_ID  0a703yvda9z4h, child number 0
-------------------------------------
select sum (c.one) row_count 
from child c, parent p, grandparent gp  
where c.p_id = p.pkid   
  and p.gp_id = gp.pkid   
  and gp.pct05 = 1
  and c.gp_id = gp.pkid

Plan hash value: 4140991566

----------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |               |       |       | 12271 (100)|          |
|   1 |  SORT AGGREGATE                    |               |     1 |    29 |            |          |
|   2 |   NESTED LOOPS                     |               |    50 |  1450 | 12271   (1)| 00:02:28 |
|   3 |    NESTED LOOPS                    |               | 49500 |  1450 | 12271   (1)| 00:02:28 |
|*  4 |     HASH JOIN                      |               |   500 |  8500 |   244   (1)| 00:00:03 |
|*  5 |      TABLE ACCESS FULL             | GRANDPARENT   |     5 |    40 |     5   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL             | PARENT        |   100K|   878K|   238   (1)| 00:00:03 |
|   7 |     BITMAP CONVERSION TO ROWIDS    |               |       |       |            |          |
|   8 |      BITMAP AND                    |               |       |       |            |          |
|   9 |       BITMAP CONVERSION FROM ROWIDS|               |       |       |            |          |
|* 10 |        INDEX RANGE SCAN            | IX_CHILD_PID  |    99 |       |     2   (0)| 00:00:01 |
|  11 |       BITMAP CONVERSION FROM ROWIDS|               |       |       |            |          |
|* 12 |        INDEX RANGE SCAN            | IX_CHILD_GPID |    99 |       |    21   (0)| 00:00:01 |
|  13 |    TABLE ACCESS BY INDEX ROWID     | CHILD         |     1 |    12 | 12271   (1)| 00:02:28 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("P"."GP_ID"="GP"."PKID")
   5 - filter("GP"."PCT05"=1)
  10 - access("C"."P_ID"="P"."PKID")
  12 - access("C"."GP_ID"="GP"."PKID")

The only difference to the previous query is the addition of the extra filter condition of "c.gp_id = gp.pkid", using the redundant join between Child and Grandparent. The row count is the same - 50,000 - because the extra filter condition is redundant and doesn't change the number of matching rows in any way. But the execution plan is completely different, because the Optimizer has assumed that the two filters on Child are independent of each other, but this is not true.

The execution plan starts the same, filtering on Grandparent to 5 estimated rows, then joining to Parent to produce 500 estimated rows (operation 4 Hash Join). Now it does 2 Nested Loops - the inner to get ROWID's from Child of matching rows, and the outer to get the data row itself for the "one" column used in the output. And this outermost Nested Loop (operation 2) is only estimated to produce 50 rows, which is clearly incorrect.

What has happened is that because the "gp_id" column in Child has 1,000 distinct values in it, the Optimizer has reduced the final row estimate by this ratio i.e. it is estimating 50 matching rows and not 50,000 matching rows. Based on this it has costed the index based access to Child to come out at 12,271 which is lower than the 25,000+ cost of the full table scan in the first execution plan.

This cost seems to be arrived at as the cost of index access for one pair of values - 21 + 2 = 23 - plus one more for the access to the data row in the table by ROWID i.e. 24 cost per Child row. For the estimated 50 Child rows this gives a total cost of 12,000, which with the costs so far to the Hash Join of 244 are very close to the reported total cost of 12,271.

In fact this execution plan will take longer to execute because it will actually be processing 50,000 Child records and not just 50, and the execution plan of the first query would actually be the better one. But it is the assumption that these two foreign key columns are independent of each other that has led the Optimizer to produce this sub-optimal plan.

Third query - remove Parent table completely from query, and join directly from Child to Grandparent:

SQL> @test4

 ROW_COUNT
----------
     50000

SQL_ID  76tptvjkbx08x, child number 0
-------------------------------------
select sum (c.one) row_count   
  from child c, grandparent gp
 where c.gp_id = gp.pkid
   and gp.pct05 = 1

Plan hash value: 2796906588

-----------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |       |       | 25435 (100)|          |
|   1 |  SORT AGGREGATE     |             |     1 |    15 |            |          |
|*  2 |   HASH JOIN         |             | 50000 |   732K| 25435   (1)| 00:05:06 |
|*  3 |    TABLE ACCESS FULL| GRANDPARENT |     5 |    40 |     5   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| CHILD       |    10M|    66M| 25403   (1)| 00:05:05 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C"."GP_ID"="GP"."PKID")
   3 - filter("GP"."PCT05"=1)

The result is the same as before - 50,000 rows - and the execution plan is basically the first one with Parent removed - two full table scans and a Hash Join. With only one filter condition on the Child table the Optimizer has correctly estimated the number of matching rows - 50,000 - and produced the appropriate execution plan.

Conclusion

Beware of database designs that include redundant foreign key columns between child and grandparent tables, or across even more levels of a relationship hierarchy, and how you write queries involving such tables. While it might seem somehow better to include all known joins between all the tables in a query, this can cause the Optimizer to misestimate the number of matching rows from a table, and to choose non-optimal access methods as a result. Rather than helping the Optimizer, adding such redundant join conditions to queries actually makes things worse. As a general rule you are better off only having joins in a query between directly related tables, and not to indirectly related ones. Otherwise, eliminate the middle, intermediate tables if possible and join directly between the relevant tables, which should produce the same results.

I saw this happen recently at a client, and removing such a redundant join condition led to the Optimizer producing more realistic row estimates and a much better execution plan, with a corresponding faster execution given the data volumes involved.

The other approach to dealing with this kind of scenario is to provide extra information to the Optimizer so that it knows when two or more columns are correlated. This can be done simply by creating an index on just those columns in the table, or by creating extended statistics on the column group. As long as this extra information is present, then the Optimizer will produce a better row estimate and a better execution plan. When I tested this with such an index on the Child table on both "p_id" and "gp_id" together, the second query with the redundant join the execution plan went back to the original one - 3 full table scans and 2 hash joins.

Also be aware that Oracle is always trying to make the Optimizer more "intelligent" so the behaviour and execution plans produced can change between versions of Oracle for the same query on the same database tables. I briefly tested these same queries on Oracle 12c and got completely different execution plans. In 12c I actually got an Adaptive Execution Plan which combined both plan variations under a parent Statistics Collector operation. Depending on how many matching rows were initially retrieved it could either do a full table scan on Parent and Child with Hash Joins, or use the indexes on the foreign key columns to do direct row lookups in Nested Loops. Combined with other feedback mechanisms in the 12c Optimizer, it is possible for the Optimizer to now detect such non-optimal execution plans and produce better ones on the next execution of the same query.

I'm back, again, late 2017

2017-10-31T10:59:00.003+00:00

Nothing much to say other then I'm back again, and hope to do some more blog posting soon. My excuse for the lack of posts is that I've been busy helping a client with a large data migration project. That is now over, with all the data successfully extracted for loading into their new system, so I've got more time available to properly write up some technical Oracle or performance posts.

Fixing Popular Posts Margin on Blogger

2016-12-29T15:57:00.003+00:00

I use the Blogger platform for this blog, and I recently added the "Popular Posts" widget to the sidebar. Unfortunately it did not display correctly, with the first character or two of each blog post title being lost and chopped off, as if the whole thing had been shifted to the left for some reason. Here is how I fixed it to display properly.

After a lot of reading up on HTML and CSS I realised that these posts appeared within a section with an associated style name, in this case a class. In turn I could add an entry to the HTML template for my blog to define a shift to the right for this particular class style so that it would display properly. And this shift would only be applied to this section of the blog page and no other.

What this boils down to is the following:

The "Popular Posts" widget reference in the HTML has an attribute of "id='PopularPosts1'"
The list of posts within this widget has two class attributes, one of which is "popular-posts"
I could shift the list of posts far enough across to the right using the style property of "padding-left"

To apply this to your Blogger blog template do the following:

In Blogger click on "Template" on the left hand side of your blog dashboard (under Layout and above Settings)
Click the on "Edit HTML" button under the small image of your blog
Expand the "
```
<b:skin></b:skin>
```
" section near the top by clicking on it or on the right pointing triangle on the left hand side
- This was at line 7 on my template
Scroll down to the end of this section, which finishes with

]]></b:skin>

Above this on their own lines add the following:

#PopularPosts1 .popular-posts {
 padding-left: 15px
}

Then click on the "Save Template" button at the top of the screen
If you now redisplay your blog you should see the "Popular Posts" list has now been shifted over to the right and aligned nicely under the "Popular Posts" heading

Note the following:

The first line states that this style is only to be used within a "popular-posts" class element that itself occurs within a "PopularPosts1" ID element.
- This should match exactly the content section we want to be shifted across in the blog page
The second line sets the left hand side padding before the displayed content to be "15px" (pixels)
- The value of "15px" was obtained by simple trial and error starting with smaller values until the list was shifted over enough.

I'm not saying my solution is perfect or absolutely correct in any way, I'm just saying that it works for me, and it seems to conform to the various ways CSS works and how Blogger defines its page sections. It would seem that all of the different sidebar elements have a right shift built into them, except for Popular Posts of just titles (no snippets or thumbnails). This solution adds in such a right shift so the blog post titles are now all visible in the blog page.

Announcing Bottleneck Data Solutions

2016-12-12T13:02:00.000+00:00

SPOILER ALERT: This particular blog post is a blatant self promotion for me and the various Oracle database services I am offering.

I may not have mentioned it explicitly before but I have left the corporate world of permanent employment and have gone independent to offer my services direct to clients (at reasonable rates, of course). I'm doing this through my newly created company Bottleneck Data Solutions, which has its own associated, obligatory web site.

Basically I am offering my many years of knowledge, experience and expertise on Oracle database design and performance tuning directly to clients for short term consulting engagements or longer term contracts. I'm not big on doing the self promotion thing unfortunately, but I won't succeed unless I get the message out and publicise myself one way or another. So this is a one-off post to raise awareness of what I'm offering through Bottleneck Data Solutions and make you aware that I'm available for Oracle database performance tuning, database design and development engagements.

I have also uploaded some of the SQL Scripts that I commonly use, to share with everyone. Nothing revolutionary I am afraid, just the standard "free space" and "system activity summary" type reports.

At the moment I am offering to do a free, initial Performance Review of an AWR Report. If I can spot anything obvious affecting performance in the AWR Report then I'll provide feedback to you.

Why "Bottleneck Data Solutions"? Well data is everywhere, and is the lifeblood of the modern company. But I have seen a significant number of cases where slow application performance was caused by a badly designed and implemented database. In other words, the way the data was organized and stored in the database was becoming the bottleneck, limiting how quickly data could be supplied to the application software. Solving this kind of bottleneck requires a mix of short term performance tuning to deal with the most obvious hot spots, and a longer term review of the database and redesigning the most critical parts.

Thank you for your patience. The normal service of technical blog posts will be resumed again in the future.

When 2 Queries Are Better Than 1

2016-12-01T16:37:00.000+00:00

A general rule of thumb with queries in Oracle is to use just one query when you can and leave it to the Optimizer to work out the best possible execution plan i.e. decomposing one query into multiple separate queries can end up making the database do more work than it needs to. I came across a case the past week where the opposite was true - making Oracle do separate "smaller" queries was far quicker than using one overall query.

Background

Imagine that we have two tables - new data arrives in one table (INCOMING) and we process it in some way and insert the resultant data into another table (OUTGOING). Data records in both tables have a timestamp on them for when they were created, and they are inserted in timestamp order. We want to process new messages at regular intervals adding them to the OUTGOING table, but they are not deleted from the INCOMING table when processed. Instead a separate purge job runs each night to delete incoming data older than 48 hours. In this scenario we can use the timestamp on each record to work out the maximum timestamp of the OUTGOING data records we have processed so far, and only retrieve INCOMING data records that have been added since then.

select max (ts) from outgoing ;

An additional complexity in this is that the INCOMING data records fall into different categories or types, and these are processed separately to each other due to different processing rules and requirements. This means that when getting the maximum timestamp from the OUTGOING data set we also need to restrict on the category.

select max (ts) from outgoing 
where category = 'CATEGORY1' ;

But it also turns out that some categories are similar enough to each other that they can be processed together at the same time from INCOMING, which should be more efficient that executing multiple separate queries against what can be a large input table with historical data in it. This means we now need the maximum timestamp from OUTGOING across a number of categories. This is most obviously done with this query.

select max (ts) from outgoing 
where category in ( 'CATEGORY1' , 'CATEGORY2', 'CATEGORY3' ) ;

This is valid because these 3 categories are all extracted from INCOMING at the same time and processed together, so we do want the maximum timestamp across all of them together.

The Problem

When executed this "SELECT MAX (TS) WHERE CATEGORY IN (...)" query took a long time to execute (tens of seconds). Previous tests had shown that Oracle can efficiently use an index to get a minimum or maximum value from a table (sub-second), so why was this particular query taking so much longer?

The conclusion was that a single category maximum or minimum is executed differently to one on a group of categories, and this was the cause of the longer execution time. Furthermore, rewriting the query as a union of separate single category queries was much faster (over 100 times faster in this case!) and almost as fast as the original single category query.

Test Environment

This happened for real on Oracle 12c, on a non-CDB database. I'm going to replicate it on a separate test environment, also on Oracle 12c on a non-CDB database, running on Oracle Linux 7.

SQL> select * from v$version ;
BANNER                                                                           CON_ID
---------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production          0
PL/SQL Release 12.1.0.2.0 - Production                                                0
CORE        12.1.0.2.0        Production                                              0
TNS for Linux: Version 12.1.0.2.0 - Production                                        0
NLSRTL Version 12.1.0.2.0 - Production                                                0

I have taken a copy of the data in DBA_OBJECTS which in my test instance had over 90,000 rows in it occupying 12 MB of disk storage, and then doubled this 9 times to be 512 times larger than the original data set, giving over 45 million rows using over 5 GB of disk storage.

Initially I thought about using the CREATED column for the timestamp, and the OBJECT_TYPE for the category, but this ran into problems. First is that the doubling the rows each time does not change any data values, which means that we have very few different, distinct values for CREATED and each value repeats many times. This has an impact on calculations the Optimizer makes and the costs it estimates. Also the real data type in the real system was TIMESTAMP and not DATE. So I decided to add an extra column TS to the test table, and to update this later on to be a near-unique value. By giving the column an initial value it means that space is allocated within each record in each block in Oracle, and the later update should not cause any more space to be allocated or rows to be moved between blocks.

My test system has only 2 GB for UNDO, which means that I cannot do an UPDATE on a table larger than this. So what I have done is double the table 7 times to just over 1 GB in size, do the UPDATE on the TS to a near-unique random value, then double the table another 2 times. This means that each TS value occurs 4 times, which I hope will still be good enough for the tests I am doing.

In terms of the spread of values, in the real system the incoming data is purged every day, deleting data records older than 48 hours. The UPDATE is using the current system timestamp value, offset by up to 48 hours as a value in seconds.

create table outgoing
tablespace testdata
as select * from dba_objects ;
--
alter table outgoing modify (owner not null, object_name not null,
object_id not null, object_type not null, created not null) ;
--
-- Try and make the data loading go faster by minimising redo logging
alter table outgoing nologging ;
--
alter table outgoing add (ts timestamp default systimestamp not null) ;
--
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
--
update outgoing
set ts = systimestamp - numtodsinterval (dbms_random.value (0, 60*60*24*2), 'second') ;
commit ;
--
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
insert /*+ append */ into outgoing select * from outgoing ;
commit ;
--
select count (*) cnt , count (distinct ts) ts_cnt from outgoing ;
--
select bytes / (1024 * 1024) mb
from user_segments
where segment_name = 'OUTGOING' ;

This produced 46,384,640 rows occupying 6.3 GB of disk storage. I also checked the minimum and maximum values of the timestamp column, and they did cover the past 48 hours (29-NOV-16 14.56.36.872969 to 01-DEC-16 14.56.36.863852).

Test Queries & Execution Plans

In the real system there are less than 10 categories, and the processing involves picking some of the ones that occur the most frequently. I'll use the OBJECT_TYPE for the category, and pick the top 3 most frequently occurring values for the test queries:

select object_type, count (*)
from outgoing
group by object_type order by 2, 1 ;

OBJECT_TYPE               COUNT(*)
----------------------- ----------
...
TABLE                      1217536
TYPE                       1296384
INDEX                      2208256
VIEW                       3252224
JAVA CLASS                15777792
SYNONYM                   18917376

So our test categories will be the bottom 3.

Let's create an index on the category and timestamp columns, again just like the real system:

create index ix_outgoing on outgoing (object_type, ts)
tablespace testdata ;

And let's make sure our statistics on this table are up to date:

exec dbms_stats.gather_table_stats ('JOHN', 'OUTGOING')

When I ran these test queries and then displayed the execution plans the costs shown seemed wrong for the second query of multiple categories. The reported cost was too low given the much higher elapsed time and what I knew it was doing during the execution. To show the real elapsed time I have done "set timing on" in SQL*Plus, and I have run the queries with the "gather_plan_statistics" hint, and used extra options to get the run time execution statistics for the query execution. This will show more about what really happened during the execution.

And to try and make things more even and comparable between the queries I also flushed the buffer cache and shared pool between the execution of each query (done from another SQL*Plus session connected as SYSDBA).

A single maximum for one query has the following execution plan:

SQL> select /*+ gather_plan_statistics */ max (ts) from outgoing where object_type = 'SYNONYM' ;

MAX(TS)
---------------------------------------------------------------------------
01-DEC-16 14.56.36.841635

Elapsed: 00:00:00.14

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'TYPICAL -BYTES IOSTATS LAST'));

SQL_ID 5mxhp3mh3gzmj, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ max (ts) from outgoing 
where object_type = 'SYNONYM'

Plan hash value: 856418741

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |      1 |        |     4 (100)|          |      1 |00:00:00.01 |       4 |      4 |
|   1 |  SORT AGGREGATE              |             |      1 |      1 |            |          |      1 |00:00:00.01 |       4 |      4 |
|   2 |   FIRST ROW                  |             |      1 |      1 |     4   (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      4 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)| IX_OUTGOING |      1 |      1 |     4   (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      4 |
---------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("OBJECT_TYPE"='SYNONYM')

The multi-category query has the following execution plan:

SQL> select /*+ gather_plan_statistics */ max (ts) from outgoing 
     where object_type in ( 'SYNONYM' , 'JAVA CLASS' , 'VIEW' ) ;

MAX(TS)
---------------------------------------------------------------------------
01-DEC-16 14.56.36.841635

Elapsed: 00:00:36.98

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'TYPICAL -BYTES IOSTATS LAST'));

SQL_ID anb3s44frw6xa, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ max (ts) from outgoing 
where object_type in ( 'SYNONYM' , 'JAVA CLASS' , 'VIEW' )

Plan hash value: 2754621723

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name        | Starts | E-Rows | Cost (%CPU)| E-Time   | A-Rows | A-Time     | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |      1 |        |     4 (100)|          |      1 |00:00:36.87 |     169K|    169K|
|   1 |  SORT AGGREGATE    |             |      1 |      1 |            |          |      1 |00:00:36.87 |     169K|    169K|
|   2 |   INLIST ITERATOR  |             |      1 |        |            |          |     37M|00:02:23.89 |     169K|    169K|
|*  3 |    INDEX RANGE SCAN| IX_OUTGOING |      3 |      1 |     4   (0)| 00:00:01 |     37M|00:01:02.14 |     169K|    169K|
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(("OBJECT_TYPE"='JAVA CLASS' OR "OBJECT_TYPE"='SYNONYM' OR "OBJECT_TYPE"='VIEW'))

The rephrased query using 3 separate per-category queries is:

SQL> select /*+ gather_plan_statistics */ max (maxts) from (
     select max (ts) maxts from outgoing where object_type = 'SYNONYM'
     union all
     select max (ts) maxts from outgoing where object_type = 'JAVA CLASS'
     union all
     select max (ts) maxts from outgoing where object_type = 'VIEW'
     ) ;

MAX(MAXTS)
---------------------------------------------------------------------------
01-DEC-16 14.56.36.841635

Elapsed: 00:00:00.15

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'TYPICAL -BYTES IOSTATS LAST'));

SQL_ID 1yv5abfvs46y7, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ max (maxts) from ( 
select max (ts) maxts from outgoing where object_type = 'SYNONYM' 
union all select max (ts) maxts from outgoing where object_type = 'JAVA CLASS' 
union all select max (ts) maxts from outgoing where object_type = 'VIEW' )

Plan hash value: 2647832233

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name        | Starts | E-Rows | Cost(%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |             |      1 |        |    12(100)|          |      1 |00:00:00.04 |      12 |      9 |
|   1 |  SORT AGGREGATE                 |             |      1 |      1 |           |          |      1 |00:00:00.04 |      12 |      9 |
|   2 |   VIEW                          |             |      1 |      3 |    12  (0)| 00:00:01 |      3 |00:00:00.04 |      12 |      9 |
|   3 |    UNION-ALL                    |             |      1 |        |           |          |      3 |00:00:00.04 |      12 |      9 |
|   4 |     SORT AGGREGATE              |             |      1 |      1 |           |          |      1 |00:00:00.01 |       4 |      4 |
|   5 |      FIRST ROW                  |             |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      4 |
|*  6 |       INDEX RANGE SCAN (MIN/MAX)| IX_OUTGOING |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      4 |
|   7 |     SORT AGGREGATE              |             |      1 |      1 |           |          |      1 |00:00:00.01 |       4 |      3 |
|   8 |      FIRST ROW                  |             |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      3 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)| IX_OUTGOING |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.01 |       4 |      3 |
|  10 |     SORT AGGREGATE              |             |      1 |      1 |           |          |      1 |00:00:00.02 |       4 |      2 |
|  11 |      FIRST ROW                  |             |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.02 |       4 |      2 |
|* 12 |       INDEX RANGE SCAN (MIN/MAX)| IX_OUTGOING |      1 |      1 |     4  (0)| 00:00:01 |      1 |00:00:00.02 |       4 |      2 |
------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access("OBJECT_TYPE"='SYNONYM')
   9 - access("OBJECT_TYPE"='JAVA CLASS')
  12 - access("OBJECT_TYPE"='VIEW')

Conclusions

The single category query executes in under 1 second (0.14 seconds on my test system), while the multiple category query using "IN" takes over 30 seconds (36.98 seconds) which is much slower (over 100 times slower for elapsed time).

We can see that the single category query uses an optimized index access to just get the maximum timestamp value - "INDEX RANGE SCAN (MIN/MAX)" combined with a "FIRST ROW" operation in the execution plan. However the multiple category query using "IN" does not do this and instead does an "INDEX RANGE SCAN" combined with a "INLIST ITERATOR" which takes far longer to execute (as seen from the A-Time value in the reported execution plan and the 37M for A-Rows).

The solution is to rewrite the query as a "UNION ALL" of 3 separate single category queries. We can safely use "UNION ALL" because the categories are distinct from each other. This query executes in under a second which is almost as fast as for just one category, and uses the optimized "INDEX RANGE SCAN (MIN/MAX)" access method combined with the "FIRST ROW" operation.

Sometimes it can be quicker to run separate SELECT queries and merge the results together than to run one single query that does it all at once. Sometimes it can even be 100 times faster to run separate queries than one single merged query.

AWR Summary Data Extracts

2016-10-20T11:57:00.000+01:00

A long time ago (in a galaxy far away) I wrote a series of blog posts about directly extracting useful sets of data from the AWR snapshots in the DBA_HIST tables for subsequent analysis and graphing using things like Excel. This post is to summarise where I got to on this, and provide links back to the main posts for future reference.

A guiding principle of my approach was that I wanted to avoid the "mother of all AWR queries" that would extract every possible data value from all DBA_HIST tables for all snapshots, as it would be a really big and complex query (meaning difficult to debug if a mistake was introduced anywhere) and it could use up a lot of system resources and take some time to finish each time it was executed. Instead I prefer the smaller, targeted type of query that extracts just what you need from the minimum number of AWR tables, because it is simpler to understand and will be quicker to execute. That said, you could extend my approach of one sub-query per DBA_HIST table and include all of the main AWR snapshot tables into one single query, and then create a view using it. This would allow you to write simpler analysis queries later on against the view, without needing to need to know the details of all the various sub-queries and DBA_HIST tables.

Along the way I covered the following main topics in the blog posts:-

AWR Snapshots themselves and DB Time Model data per snapshot
- Introduction and how to get the change in values between snapshots
Wait Times broken down by class - I/O, Network, Cluster, etc.
Individual wait times by wait event name, allowing specific waits to be reported
- e.g. "db file sequential read", "log file sync"
System statistics (across all sessions)
- e.g. "execute count", "user calls", "physical reads", "physical writes"

Although I did not post a query that merged all the different component sub-queries together, I did post a minimal Summary AWR Query at the end to show how you could use the basic AWR data in practise to look at the overall performance and behaviour of a system. By extending the main SELECT of the final query itself, you can add extra columns of data you want into the output, without any changes to the various component sub-queries (the "with" or Common Table Expressions). To put it another way, if there is some extra AWR data you want extracted then you need to work out which of the main data sets it falls under, and then which columns or rows in that AWR data set the data you want is stored in.

For instance here is another variation of the AWR Summary Query that also extracts the number of SQL statements executed, the separate wait times for User I/O and Cluster waits (normalised to a percentage of the total wait time in the snapshot), physical reads (normalised to a per second value), average disk read and redo write times.

/**
 * Name      : awr_sum_lite
 * Purpose   : Minimal one line summary of each AWR snapshot from a day
 * Descrip'n : Based on awr_sum_range with unneeded stuff deleted
 *             Only key essential stuff is output:
 *                 SQL executions / sec, #User calls / sec
 *                 Average Active Sessions, Wait % of Busy time
 *                 Cluster Wait % of wait time, User I/O % of wait time
 *                 Avg disk read time, Avg redo write time, #Disk reads / sec
 *             Values directly calculated, rather than raw underlying values
 * Usage     : @awr_sum_lite
 * Assumption: Want the last 4 whole days
 *               Which is from midnight 4 days ago to last midnight
 * To Do     : 
 */
--
set feedback off
set heading off
set newpage none
set verify off
--
set linesize 1000
set pages 28
set trimout on
set trimspool on
--
select ' ' from dual ;
select '                 AWR Lite Summary Report' from dual ;
select '                 =======================' from dual ;
-- select ' ' from dual ;
-- select ''
--    || '<= Database Time =>'
--    || '<= System Statistics ==>'
--    || '<== Waits ================>'
--  from dual ;
set newpage 1
set heading on
--
col snap_time           format a15      heading 'SNAP TIME'
col aas                 format 90.0     heading 'AAS'
col wait_pct            format 990.0    heading 'WAIT%'
col cluster_pct         format 90.0     heading 'CLUS%'
col user_io_pct         format 90.0     heading 'UIO%'
col executions_sec      format 9,990    heading 'EXEC/S'
col user_calls_sec      format 9,990    heading 'UCALL/S'
col physical_reads_sec  format 9,990    heading 'READS/S'
col avg_disk_read       format 990.0    heading '(MS)|AV RD'
col avg_redo_write      format 990.0    heading '(MS)|REDO W'
--
--
with
snaps as 
(select snap_id
      , dbid
      , end_snap_time
      , snap_interval
      , extract (second from snap_interval) 
       + (extract (minute from snap_interval) 
          + (extract (hour from snap_interval)
             + (extract (day from snap_interval) * 24)
             ) * 60
          ) * 60 snap_duration
  from (select csnaps.snap_id
             , csnaps.dbid
             , min (csnaps.end_interval_time) end_snap_time
             , min (csnaps.end_interval_time) - min (csnaps.begin_interval_time) snap_interval
          from dba_hist_snapshot csnaps
         group by csnaps.snap_id, csnaps.dbid
       )
) -- snaps
, systimes as 
-- One row per Database Time Model with change in value between snapshots
(select systime.snap_id
      , systime.dbid
      , systime.stat_name
      , sum (systime.value - psystime.value) value
         from dba_hist_sys_time_model systime, dba_hist_sys_time_model psystime
        where systime.snap_id = psystime.snap_id + 1
          and systime.dbid = psystime.dbid
          and systime.instance_number = psystime.instance_number
          and systime.stat_id = psystime.stat_id
-- Assume if stat_id the same so is the stat_name
        group by systime.snap_id, systime.dbid, systime.stat_name
) -- systimes
, sysstats as 
-- One row per System Statistic with change in value between snapshots
(select sysstat.snap_id
      , sysstat.dbid
      , sysstat.stat_name
      , sum (sysstat.value - psysstat.value) value
         from dba_hist_sysstat sysstat, dba_hist_sysstat psysstat
        where sysstat.snap_id = psysstat.snap_id + 1
          and sysstat.dbid = psysstat.dbid
          and sysstat.instance_number = psysstat.instance_number
          and sysstat.stat_id = psysstat.stat_id
-- Assume if stat_id the same so is the stat_name
        group by sysstat.snap_id, sysstat.dbid, sysstat.stat_name
) -- sysstats
, syswaits as 
-- One row for total wait time, plus break down into major wait classes, and events
(select syswaitevents.snap_id
      , syswaitevents.dbid
      , sum (syswaitevents.time_waited_micro) all_wait_time
      , sum (decode (syswaitevents.wait_class, 'Commit',      syswaitevents.time_waited_micro, 0)) commit_time
      , sum (decode (syswaitevents.wait_class, 'Cluster',     syswaitevents.time_waited_micro, 0)) cluster_time
      , sum (decode (syswaitevents.wait_class, 'Concurrency', syswaitevents.time_waited_micro, 0)) concurrency_time
      , sum (decode (syswaitevents.wait_class, 'Network',     syswaitevents.time_waited_micro, 0)) network_time
      , sum (decode (syswaitevents.wait_class, 'System I/O',  syswaitevents.time_waited_micro, 0)) system_io_time
      , sum (decode (syswaitevents.wait_class, 'User I/O',    syswaitevents.time_waited_micro, 0)) user_io_time
   from 
        (select sysevent.snap_id
              , sysevent.dbid
              , sysevent.wait_class
              , sysevent.event_name
              , sum (sysevent.time_waited_micro - psysevent.time_waited_micro) time_waited_micro
              , sum (sysevent.total_waits - psysevent.total_waits) wait_count
           from dba_hist_system_event sysevent, dba_hist_system_event psysevent
          where sysevent.snap_id = psysevent.snap_id + 1
            and sysevent.dbid = psysevent.dbid
            and sysevent.instance_number = psysevent.instance_number
            and sysevent.event_id = psysevent.event_id
            and sysevent.wait_class != 'Idle'  -- Ignore Idle wait events
          group by sysevent.snap_id
                 , sysevent.dbid
                 , sysevent.wait_class
                 , sysevent.event_name
        ) syswaitevents
  group by syswaitevents.snap_id
         , syswaitevents.dbid
) -- syswaits
-- Average Active Sessions, Wait % of Busy time
-- Cluster Wait % of wait time, User I/O % of wait time
-- SQL executions / sec, #User calls / sec
-- Avg disk read time, Avg redo write time, #Disk reads / sec
select to_char (snaps.end_snap_time, 'DD/MM/YY HH24:MI') snap_time
     , (user_calls_st.value / snaps.snap_duration)              user_calls_sec
     , (execs.value / snaps.snap_duration)                      executions_sec
     , (dbtime.value / 1000000)       / snaps.snap_duration     aas
-- If database active time is 1% of duration time or less ignore wait (0)
     , case when (dbtime.value / (1000 * snaps.snap_duration) ) > 1
            then (100 * syswaits.all_wait_time  / dbtime.value)
            else 0.0
       end wait_pct
     , (100 * syswaits.cluster_time   / syswaits.all_wait_time) cluster_pct
     , (100 * syswaits.user_io_time   / syswaits.all_wait_time) user_io_pct
     , (phys_reads.value / snaps.snap_duration)                 physical_reads_sec
     , (syswaits.user_io_time / phys_reads.value) / 1000        avg_disk_read
     , (redo_time_st.value * 10 / redo_write_st.value)          avg_redo_write
  from snaps
     join (select * from systimes where stat_name = 'DB time') dbtime
       on snaps.snap_id = dbtime.snap_id and snaps.dbid = dbtime.dbid
     join syswaits
       on snaps.snap_id = syswaits.snap_id and snaps.dbid = syswaits.dbid
     join (select * from sysstats where stat_name = 'execute count') execs
       on snaps.snap_id = execs.snap_id and snaps.dbid    = execs.dbid
     join (select * from sysstats where stat_name = 'user calls') user_calls_st
       on snaps.snap_id = user_calls_st.snap_id and snaps.dbid = user_calls_st.dbid
     join (select * from sysstats where stat_name = 'redo writes') redo_write_st
       on snaps.snap_id = redo_write_st.snap_id and snaps.dbid  = redo_write_st.dbid
     join (select * from sysstats where stat_name = 'redo write time') redo_time_st
       on snaps.snap_id = redo_time_st.snap_id and snaps.dbid  = redo_time_st.dbid
     join (select * from sysstats where stat_name = 'physical reads') phys_reads
       on snaps.snap_id = phys_reads.snap_id and snaps.dbid    = phys_reads.dbid
 where snaps.end_snap_time between 
       (trunc (sysdate) - 4) and (trunc (sysdate))
 order by snaps.end_snap_time
/
--
set feedback on
set lines 80

This query provides what I believe are a good, minimal set of useful measurements of activity and relative performance of an Oracle database. It allows easy viewing of relative activity (Average Active Sessions, User Calls / sec), waiting time as a percentage of database time, the main types of wait occurring (User I/O and Cluster), and some key disk I/O measurements (physical reads / sec, average disk read time, average redo write time). From these you can see how active the system is, whether any significant waits are occurring, and if they are I/O related, and see how these change over time between AWR snapshots.

Of course everyone has a different opinion of what the most important measurements are on an Oracle database. My key message here is that the approach I have followed lets you customise these queries to report out the performance and activity measurements you want from the data captured in the AWR snapshots. So if you want a different set of data values reported out, then it should be relatively easy to modify these queries to get what you want.

Full Table Scan not always as low as 0.5% of data!

2016-03-30T11:55:00.000+01:00

Based on a reply from Jonathan Lewis to an OTN post on Explain Plans I need to correct some of the claims made in my previous post on Full Table Scans.

I'm going to repeat the inaccurate paragraphs and the Conclusions I posted then [edited here for brevity], and then point out the mistakes in them, and an attempt at a corrected set of paragraphs. I will also edit the original post and replace these incorrect paragraphs with the corrected ones.

I'll highlight the parts that are not strictly correct.

The main determinant for whether a FTS is the best access method is the fraction or percentage of the rows in the table being retrieved by the query, which are of course determined by the filter conditions in the query. If this fraction / percentage of data from the table is high enough then a Full Table Scan will always be a lower cost than using an Index Scan (except maybe for an Index Fast Full Scan, which is really another variation on a Full Scan).

So when you see a Full Table Scan in an execution plan you should check the query itself for how many rows it needs from the total number of rows in the table i.e. what percentage or fraction. If this is high enough then a FTS is indeed the "best" access method to get that data because it has the lowest cost. In which case you should really be asking why the query needs such a high percentage of the data from one table. Don't assume that the Optimizer is wrong - normally it isn't. It is just as likely to be your query that is affecting the execution plan chosen.

Conclusions
Although a Full Table Scan can seem a "brute force" approach to finding some matching records in a table, it can sometimes be the better way of doing it though. It all depends on how many rows you want back from the table as a percentage of the rows in the table.

If your query is retrieving more than about 0.20% to 0.35% of the data in a table then a Full Table Scan may well be the cheapest and best access method. [...]. Trying to force the Optimizer into using another access method in this circumstance is a waste of time, because all other access methods will be more expensive. The only exception might be another Full Scan type access, such as an Index Full Scan. But even then the gains (reduction in cost) will only be marginal i.e. not a full order of magnitude less.

When you see a FTS in a query execution plan you should check the estimated row count and what this is as a percentage of the row count in the table, and confirm if this estimate is correct or not. If the estimate is correct then a FTS is the lowest cost access method and the Optimizer is right to choose it. You should also check if your query is correct, or if there is something wrong with the filter conditions in it.

The main error is that it is wrong for me to claim any kind of actual figures for the percentage when the cutover will occur. As Jonathan points out in his reply on OTN it is possible for an index access to have a lower cost than a FTS for 24.5% of the data in a table, which is far above the 0.35% figure I mention. The actual cutover percentage is dependent on so many variables that it is incorrect for me to try and state a specific range for the cutover.

While I was not necessarily wrong in my specific examples because I made clear what my assumptions were, I was also not right in all possible scenarios. So I'm withdrawing my original Conclusions and updating them to be more strictly correct.

[paragraph deleted]

So when you see a Full Table Scan in an execution plan you should check the query and all other relevant factors to see if a FTS is indeed the "best" access method to get that data because it has the lowest cost. [deleted sentence] Don't assume that the Optimizer is wrong - normally it isn't. It is just as likely to be your query that is affecting the execution plan chosen.

Conclusions
Although a Full Table Scan can seem a "brute force" approach to finding some matching records in a table, it can sometimes be the better way of doing it though. It depends on several factors including how many rows you want back from the table, the number of blocks for the table, and the Clustering Factor of any indexes.

It is possible that for even low percentages of data being retrieved from a table that a Full Table Scan can be the cheapest and best access method. [...]. Trying to force the Optimizer into using another access method in this circumstance is a waste of time, because all other access methods will be more expensive. The only exception might be another Full Scan type access, such as an Index Full Scan. But even then the gains (reduction in cost) will only be marginal i.e. not a full order of magnitude less.

When you see a FTS in a query execution plan you should check many things including, but not limited to, the number of rows in the table, the number of blocks used for the table, the Clustering Factor for each possible index, and the estimated row count for the filters being used.

If the estimate is correct then a FTS is the lowest cost access method and the Optimizer is right to choose it. You should also check if your query is correct, or if there is something wrong with the filter conditions in it.

Counter Example

It is relatively simple to show a similar query to the ones I was using that uses an index instead of a full table scan for a far higher percentage of data in the table than 0.5%. Jonathan posted one such counter example in one of his replies on OTN, and we can easily replicate this using the same test data set from my original post.


select sum (one) from scantest where pkid between 1000000 and 2000000 ;

  SUM(ONE)
----------
   1000001

Statistics
----------------------------------------------------------
      10468  consistent gets
      10311  physical reads
   2  SQL*Net roundtrips to/from client
   6  sorts (memory)
   0  sorts (disk)
   1  rows processed

select sum (one) from scantest where pkid between 1000000 and 2000000

Plan hash value: 40453105

-----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |       |       | 10282 (100)|          |
|   1 |  SORT AGGREGATE                      |              |     1 |     9 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST     |  1000K|  8789K| 10282   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PKID |  1000K|       |  2223   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PKID">=1000000 AND "PKID"<=2000000)

Note:

This query is retrieving 10% of the data in the table
An index range scan is being used to retrieve the data
A Full Table Scan had a cost of just over 22,000, so this index scan is much cheaper at 10,282
The key difference is the Clustering Factor of the pkid column

This index range scan is still cheaper when retrieving 20% of the data in the table:


select sum (one) from scantest where pkid between 1000000 and 3000000 ;

  SUM(ONE)
----------
   2000001

Statistics
----------------------------------------------------------
      20762  consistent gets
      20697  physical reads
   2  SQL*Net roundtrips to/from client
   6  sorts (memory)
   0  sorts (disk)
   1  rows processed

select sum (one) from scantest where pkid between 1000000 and 3000000

Plan hash value: 40453105

-----------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |       |       | 20562 (100)|          |
|   1 |  SORT AGGREGATE                      |              |     1 |     9 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST     |  2000K|    17M| 20562   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PKID |  2000K|       |  4442   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PKID">=1000000 AND "PKID"<=3000000)

Conclusion?

Be careful of generalising and summarising certain observed behaviours too much with Oracle. When considering index usage there is much more to it than just the number of rows being retrieved as a percentage of those in the table.

Creeping Slow Performance

2016-03-21T10:21:00.000+00:00

A recent OTN post requesting help for a slow performing update where Oracle was doing a full table scan, also revealed that the update was getting slower and slower each month, and that new data was loaded into the table each month to be "processed" i.e. updated. Unfortunately this kind of design suffers from a natural creeping slowdown in the elapsed time of the update process. Each month it will take longer and longer to complete the processing of the newly loaded data, until any elapsed time targets are exceeded and it continues to take longer and longer to complete. There is a bad design pattern here, but obviously the original designers and implementers did not spot it.

Consider the following - in the first month after go-live the table only contains newly loaded data, so a full table scan is used. Let's say it takes 2 minutes to complete, and our target is 30 minutes for the whole process to complete, which includes other steps as well. Everything looks good performance wise, so we leave everything as it is and carry on.

The next month - month 2 - the size of the table has doubled, and Oracle still uses a full table scan because you want to update half the data in the table. This now takes 4 minutes because the table has doubled in size. But the overall process still completes within 30 minutes, so everything still looks good.

Over the next few months the table grows each month, and the elapsed time of the update continues to increase by 2 minutes each month, assuming a similar volume of data is loaded in each month. After 5 months the update now takes 10 minutes, but this is still within our 30 minute target.

But after 10 months the elapsed time is now 20 minutes, which is a significant proportion of the 30 minutes, and people are starting to ask questions. And after 15 months, if not sooner, the 30 minute target has been missed and the overall data load and process is now taking too long. Not only that, it is continuing to take longer and longer to complete each month, and will only get worse.

What can be done? Could an index help? No, is the simple answer. As I recently blogged in Full Table Scan - Friend or Foe?, the cutover point for the Optimizer to choose to use an index is when you want to retrieve less than 0.5% or so of the data in a table. After 10 months you are still processing 10% of the data in the table each month. And worse, even after 100 months (8+ years) you are still processing 1% of the data in the table each month, and a full table scan is still a lower execution cost than using any index on the table.

What could you do about this? One obvious solution is to use a "staging table" of the same structure into which the data is initially loaded each month. Then do the update and other processing on this table, and finally insert all the data into the main table. This solution does not suffer from the constant degradation in performance that the current one does. Oracle will still use a full table scan, but it is only reading from disk the data for this month, and so should complete in a relatively constant time each month - assuming data volumes are similar each month. The insert to the main table could be done quite quickly using the "append" hint for a direct load, and then the staging table could be truncated which should be very quick. Although the data is copied around in Oracle twice, it should still be much quicker than scanning a table with hundreds of months of data in it.

Other possibilities include partitioning by a key unique to each month's data, which is also used in the update statement so that partition pruning occurs. This may be problematic if there is not a clean partition key to use. It should not be a "processed flag", as this would be updated by the processing itself which would result in row movement between partitions and extra work by Oracle.

Ultimately the issue is about spotting these kind of bad design patterns early on and avoiding these pending future performance problems. To get to one month's data being less than 0.5% of the total data volume in the table would take you over 200 months, or almost 17 years. And during those 17 years this monthly data load processing would always take longer and longer to complete. A different design can avoid this future performance problem completely.

Full Table Scan - Friend or Foe?

2016-03-01T11:42:00.003+00:00

[Or Don't be afraid of Full Table Scans]

UPDATE 30 March 2016
I've come to realise that there are some mistakes and inaccuracies in this post [see this post for more details], so I've edited this post and corrected what I can. Rather than leave behind incorrect claims, I've replaced them by the corrected text, so hopefully now this post no longer has its previous faults.

ORIGINAL POST:
Many people consider a Full Table Scan (FTS) in a query execution plan to be a bad thing - reading every record from a table to find only those records the query needs. This is indicated by a "TABLE ACCESS FULL" in an execution plan. But is it really all that bad? Is it actually sometimes the right tool for the job? Can a Full Table Scan sometimes have a lower cost than using an Index?

My view is that the Oracle Optimizer will only choose a FTS under the following two conditions:

There is no other possible access method to get the needed data from the table
- No other access method is currently available, but another access method might be a lower cost if it was available
A FTS is a lower cost (and so should be faster) than all other possible access methods
- The FTS is the "best" access method of all available and the Optimizer is correct to use it

In other words, when you see a "TABLE ACCESS FULL" in an execution plan you should not jump to the conclusion that this is wrong and "needs to be fixed", because maybe it is correct and it is indeed the best access method for that particular step of the query operation. If you have concerns about the FTS then you need to double check things to find out whether it is the right thing for the Optimizer to be choosing, or whether a better access method would reduce the execution cost. Jumping to the wrong conclusion can lead you down a dead end when trying to improve a query's performance.

So when you see a Full Table Scan in an execution plan you should check the query and all other relevant factors to see if a FTS is indeed the "best" access method to get that data because it has the lowest cost. Don't assume that the Optimizer is wrong - normally it isn't. It is just as likely to be your query that is affecting the execution plan chosen.

Calculating the Cutover Point

The actual percentage or fraction of rows in the table that is the cutover point between the Optimizer using a Full Table Scan or an available Index is itself mainly dependent on the average number of rows stored per data block. The other main factor is the value being used by the Optimizer for Multi-Block Read Count (MBRC), which is part of the System Statistics stored in the database. It is possible to calculate this cutover point yourself on a given table to see when a FTS really is cheaper and when an Index might help.

For a specific query involving an equality filter on a column, the other factor is what fraction of the data rows in the table have the same value stored in them. Oracle maintains this as a statistic on each column named "Density". By comparing the Density value for a column against the fraction of rows in the table needed by the query, you can see whether a FTS would be cheaper or not for a filter on that column. Remember that a percentage is just a fractional decimal value multiplied by 100 - so a Density of 0.0025 means 0.25%.

First the value used for MBRC (Multi-Block Read Count). This is stored in the AUX_STAT$ table owned by SYS. Its value only gets set when you gather system statistics. If set, then that particular value is used. If not set then a default value of 8 is used (note that it seems to ignore the value of the initialization parameter db_file_multiblock_read_count). The Optimizer also uses the values for MREADTIM and SREADTIM (Multi-block read time and Single block read time) also in the AUX_STAT$ table. Again, if these have not been set then it will use a default formula to derive them from the values for IOSEEKTIM and IOTFRSPEED.

To try and keep this explanation short we can jump to the following formulae used when you have not gathered system statistics.

SREADTIM = IOSEEKTIM + (DB_BLOCK_SIZE / IOTFRSPEED)  
MREADTIM = IOSEEKTIM + (MBRC * DB_BLOCK_SIZE / IOTFRSPEED)

For the default values of MBRC (8), IOSEEKTIM (10) and IOTFRSPEED (4096) and a DB_BLOCK_SIZE of 8192 (8 KB) you get an SREADTIM value of 12 milliseconds, and a MREADTIM value of 26 milliseconds.

When a Full Table Scan occurs the Optimizer knows that it will be doing multi-block reads. However, it needs to cast or convert the cost of these reads into units of single block reads. This is because all other I/O costs are in terms of single block reads internal to the Optimizer. And the "cost" of a multi-block read is not the same as the "cost" of a single block read - a multi-block read should take longer given the greater number of blocks being transferred.

Instead of dividing the number of blocks in the table by the full MBRC value, it "adjusts" this value by the ratio of SREADTIM to MREADTIM, and then uses this value in the cost calculation.

The cost of a FTS would then be calculated as (note the brackets):

FTS Cost = #Blocks in Table / (MBRC * SREADTIM / MREADTIM)
FTS Cost = (#Blocks in Table * MREADTIM) / (MBRC * SREADTIM)

For an Index lookup to be cheaper than this FTS cost, we can calculate the fraction of rows in the table where the cost of using an Index would be slightly less than this FTS cost. For this we need to know the following values from the statistics Oracle has on the table and the particular column used in the equality filter:

Number of Rows in the table - NUM_ROWS in USER_TABLES or USER_TAB_STATISTICS
Number of Blocks used for the table - BLOCKS in USER_TABLES or USER_TAB_STATISTICS
Density of the column - DENSITY in USER_TAB_COLS or USER_TAB_COL_STATISTICS

A Full Table Scan is cheaper than using an Index lookup when:

Density > (BLOCKS * MREADTIM) / (MBRC * SREADTIM * NUM_ROWS)

When the Density of a column is less than this value then an Index lookup would be cheaper.

Conclusions

Although a Full Table Scan can seem a "brute force" approach to finding some matching records in a table, it can sometimes be the better way of doing it though. It depends on several factors including how many rows you want back from the table, the number of blocks for the table, and the Clustering Factor of any indexes.

It is possible that for even low percentages of data being retrieved from a table that a Full Table Scan can be the cheapest and best access method. Trying to force the Optimizer into using another access method in this circumstance is a waste of time, because all other access methods will be more expensive. The only exception might be another Full Scan type access, such as an Index Full Scan. But even then the gains (reduction in cost) will only be marginal i.e. not a full order of magnitude less.

When you see a FTS in a query execution plan you should check many things including, but not limited to, the number of rows in the table, the number of blocks used for the table, the Clustering Factor for each possible index, and the estimated row count for the filters being used. If the estimate is correct then a FTS is the lowest cost access method and the Optimizer is right to choose it. You should also check if your query is correct, or if there is something wrong with the filter conditions in it.

Tests

Lets show whether this holds true with some tests. There is quite a lot of output here, but I want to provide everything so that anyone else can reproduce these tests on their own systems.

Database version - 12.1.0.2.0
Operating System - Oracle Linux 7.2

Scan Test table - 10 million rows, with columns of different repeated values:

drop table scantest ;
--
prompt Loading data ....
create table scantest 
tablespace testdata 
as 
select r pkid
     , 1 one -- a constant, which forces actual data row access
     , mod (r, 10)     pct10   -- 10 values     = 10% of data in table
     , mod (r, 20)     pct5    -- 20 values     = 5% of data in table
     , mod (r, 50)     pct2    -- 50 values     = 2% of data in table
     , mod (r, 100)    pct1    -- 100 values    = 1% of data in table
     , mod (r, 200)    pct05   -- 200 values    = 0.5% of data in table
     , mod (r, 500)    pct02   -- 500 values    = 0.2% of data in table
     , mod (r, 1000)   pct01   -- 1,000 values  = 0.1% of data in table
     , mod (r, 2000)   pct005  -- 2,000 values  = 0.05% of data in table
     , mod (r, 5000)   pct002  -- 5,000 values  = 0.02% of data in table
     , mod (r, 10000)  pct001  -- 10,000 values = 0.01% of data in table
     , mod (r, 20000)  pct0005 -- 20,000 values = 0.005% of data in table
     , mod (r, 50000)  pct0002 -- 50,000 values = 0.002% of data in table
  from (select rownum r
          from (select rownum r from dual connect by level <= 1000) a,
               (select rownum r from dual connect by level <= 1000) b,
               (select rownum r from dual connect by level <= 1000) c
         where rownum <= 10000000) ;
--
prompt Gathering Statistics ....
exec dbms_stats.gather_table_stats ('JOHN', 'SCANTEST')
--
prompt Creating Indexes ....
create unique index ix_scan_pkid on scantest (pkid) ;
create index ix_scan_pct2 on scantest (pct2) ;
create index ix_scan_pct1 on scantest (pct1) ;
create index ix_scan_pct05 on scantest (pct05) ;
create index ix_scan_pct02 on scantest (pct02) ;
create index ix_scan_pct01 on scantest (pct01) ;

This produces a table with the following statistics:

                                TABLE STATISTICS

Table                           %F  IT           In Ext         Next Ext  %I
------------------------------ --- --- ---------------- ---------------- ---
SCANTEST                        10   1           65,536        1,048,576

                                     Avg Spc
    Num Rows       Blocks E Blocks  Free/Blk Chains Avg Row Len
------------ ------------ -------- --------- ------ -----------
  10,000,000       80,951        0       .00      0      53.000

Table                           Num Extents       Blocks   Avg Blocks
------------------------------ ------------ ------------ ------------
SCANTEST                               151        81,920          543

                                  Leaf         Distinct       Clustering
Index                  Height   Blocks             Keys           Factor
-------------------- -------- -------- ---------------- ----------------
IX_SCAN_PKID                3   22,132       10,000,000           80,528
IX_SCAN_PCT2                3   19,503               50        4,026,368
IX_SCAN_PCT1                3   19,518              100        8,052,718
IX_SCAN_PCT05               3   20,212              200       10,000,000
IX_SCAN_PCT02               3   20,629              500       10,000,000
IX_SCAN_PCT01               3   20,768            1,000       10,000,000

Full Table Scan cost is calculated as follows:

FTS Cost = (BLOCKS * MREADTIM) / (MBRC * SREADTIM)

On my system MREADTIM and SREADTIM are not set, and the others have default values in the system statistics in AUX_STAT$ (IOSEEKTIM = 10, IOTFRSPEED = 4096), so using the formula from before this gives SREADTIM of 12 ms and MREADTIM of 26 ms.

Plugging these values into the previous formula gives about 22,000 as the cost for the FTS:

(80951 * 26) / (8 * 12) = 2104726 / 96 = 21,924.23

We can calculate the cutover point up to which a FTS would be cheaper than an Index Scan using the formula given before:

Density > (BLOCKS * MREADTIM) / (MBRC * SREADTIM * NUM_ROWS)
Density > (80,951 * 26) / (8 * 12 * 10,000,000) = 0.002192 or 0.22% approximately

Remember that this is based on several assumptions (Index Clustering Factor) and simplifications (ignoring CPU costs). Thus the cutover point will not be a precise value of 0.0022 (0.22%) but something around this value.

Having calculated the FTS cost at about 22,000 we can also calculate the expected Index Scan costs, and see whether they do drop below the FTS cost under 0.22% of the data in the table.

An Index Scan cost has 2 components:

Index Access Cost = Levels + (Leaf Blocks * Filter Factor)
 Data Access Cost = Clustering Factor * Filter Factor

The Filter Factor is the selectivity of a single value for an equality predicate filter, being the Density of the column, or one over the Number of Distinct Values. In real terms it is the percentage of rows being retrieved, which in our test table is indicated by the column name.

For all indexes the number of Levels is 3 and the number of Leaf Blocks is about 20,500. Yes, it does vary but it is not significantly different for the tests we are doing here. Also from index PCT05 onwards the Clustering Factor is always 10 million i.e. it is not clustered and it is the row count in the table.

Column Name	% of Rows returned	Fraction of Rows	Index Cost	Data Cost	Total Cost
pct1	1%	0.01	3 + (20,500 * 0.01) = 3 + 205 = 208	8,000,000 * 0.01 = 80,000	80,208
pct05	0.5%	0.005	3 + (20,500 * 0.005) = 3 + 102.5 = 105.5	10M * 0.005 = 50,000	50,106
pct02	0.2%	0.002	3 + (20,500 * 0.002) = 3 + 41 = 44	10M * 0.002 = 20,000	20,044
pct01	0.1%	0.001	3 + 20.5 = 23.5	10M * 0.001 = 10,000	10,024

Our calculations confirm that the expected cost of an Index Scan should drop below that of a Full Table Scan when less than 0.22% of the data in the table is being selected.

Does this bear out in practise? Will Oracle switch from a Full Table Scan to an Index Scan when the fraction of rows requested drops below 0.0022? Lets see.

Note: The buffer cache and shared pool were flushed from a SYSDBA session before each query execution, so the buffer cache was empty in each case. And we are only interested in the I/O statistics from the query execution - other statistics such as Redo and SQL*Net bytes sent are not relevant and have been removed.

Query 1:

set autotrace on statistics 
--
select sum (one) from scantest where pct2 = 1 ;

  SUM(ONE)
----------
    200000

Statistics
----------------------------------------------------------
      80704  consistent gets
      80541  physical reads
          2  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
          1  rows processed

Execution Plan (from dbms_xplan.display_cursor for the SQL_ID):

select sum (one) from scantest where pct2 = 1

Plan hash value: 1745049784

-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       | 22028 (100)|          |
|   1 |  SORT AGGREGATE    |          |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| SCANTEST |   200K|  1171K| 22028   (1)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("PCT2"=1)

Note:

2% of 10 million rows is 200 thousand, which the Optimizer has correctly estimated.
Full table scan cost is 22,028. This is very close to my own estimate of about 22,000. The missing cost component would be for CPU work to filter each row.
80,541 physical reads occurred, which is very close to the 80,951 blocks reported in the table.

Query 2:
Repeat this with a lower percentage column - pct05

select sum (one) from scantest where pct05 = 1 ;

  SUM(ONE)
----------
     50000

1 row selected.

Statistics
----------------------------------------------------------
      80704  consistent gets
      80541  physical reads
          2  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
          1  rows processed

select sum (one) from scantest where pct05 = 1

Plan hash value: 1745049784

-------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       | 22040 (100)|          |
|   1 |  SORT AGGREGATE    |          |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS FULL| SCANTEST | 50000 |   341K| 22040   (1)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("PCT05"=1)

Note:

Rows estimate is now 50,000, as expected
Full table scan cost is slightly different for some unknown reason, but still about 22,000.
Same number of physical reads occurred - 80,541.

Query 3:
Continue down to the next lower percentage column - pct02

select sum (one) from scantest where pct02 = 1 ;

  SUM(ONE)
----------
     20000

1 row selected.

Statistics
----------------------------------------------------------
  20212  consistent gets
  20060  physical reads
      2  SQL*Net roundtrips to/from client
      6  sorts (memory)
      0  sorts (disk)
      1  rows processed

select sum (one) from scantest where pct02 = 1

Plan hash value: 3458310886

------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |               |       |       | 20049 (100)|          |
|   1 |  SORT AGGREGATE                      |               |     1 |     7 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST      | 20000 |   136K| 20049   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PCT02 | 20000 |       |    44   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PCT02"=1)

Note:

Now 0.2% of data in this table, which is low enough that an Index Scan should be cheaper
Index cost is indeed lower than a Full Table Scan, as expected - 20,049 versus 22,040
Index cost is very close to our own calculation of 20,044 - again CPU cost is the difference
Only 20,060 physical reads now - Index branch blocks + Leaf blocks + Data rows

Query 4:
Repeat this with a lower percentage column - pct01

select sum (one) from scantest where pct01 = 1 ;

  SUM(ONE)
----------
     10000

1 row selected.

Statistics
----------------------------------------------------------
      10192  consistent gets
      10038  physical reads
          2  SQL*Net roundtrips to/from client
          6  sorts (memory)
          0  sorts (disk)
          1  rows processed

select sum (one) from scantest where pct01 = 1

Plan hash value: 1707962624

------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |               |       |       | 10025 (100)|          |
|   1 |  SORT AGGREGATE                      |               |     1 |     7 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST      | 10000 | 70000 | 10025   (1)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PCT01 | 10000 |       |    23   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PCT01"=1)

Note:

Index cost is halved because row count is halved - 10,025 versus 10,024 calculated
Only 10,038 physical reads now - Index branch blocks + Leaf blocks + Data rows

Query 5:
If we force the use of an index for the query on the PCT05 column using a hint we get the following:

select /*+ index (scantest (pct05)) */ sum (one) from scantest where pct05 = 1 ;

  SUM(ONE)
----------
     50000

Statistics
----------------------------------------------------------
  50271  consistent gets
  50133  physical reads
      2  SQL*Net roundtrips to/from client
      6  sorts (memory)
      0  sorts (disk)
      1  rows processed

select /*+ index (scantest (pct05)) */ sum (one) from scantest where pct05 = 1

Plan hash value: 3145193111

------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |               |       |       | 50116 (100)|          |
|   1 |  SORT AGGREGATE                      |               |     1 |     7 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| SCANTEST      | 50000 |   341K| 50116   (1)| 00:00:02 |
|*  3 |    INDEX RANGE SCAN                  | IX_SCAN_PCT05 | 50000 |       |   104   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("PCT05"=1)

Note how the estimated cost is 50,116, which is far greater than the 22,028 of a full table scan, and agrees with my earlier calculation of 50,106 for this index. Instead of the 80,000 physical disk reads done before, only 50,000 have been done. However, the 80,000 disk reads were really multi-block disk reads i.e. nearer 10,000 real disk reads would have been done, each for 8 disk blocks at once. The 50,000 disk reads for the index execution plan would be single block disk reads.

What's Going On? Oracle Activity Monitoring Views

2016-01-25T15:14:00.000+00:00

How do you find out what is happening now inside an Oracle database instance when you only have SQL level access to the database instance, and no nice GUI management tool? I'll cover how to start doing this with some SQL query examples.

A user is complaining that the application is running slowly, or a report someone scheduled is taking too long to finish - where do you look to find out what is going on inside the Oracle database right now? Well if you have a nice GUI based tool such as Enterprise Manager then go there first, because that will have summary screens of what is going on, and also drill downs into the details behind what is going on. Such tools are easier and quicker to use, and mean you don't have to remember the names of internal Oracle dynamic performance views.

But what if you don't have a nice GUI based tool? Or you want to work out for yourself what is really going on in Oracle? Or something has happened to the nice GUI tool and you cannot use it for one reason or another? Then you need to know about Dynamic Performance Views in Oracle, also known as the V$ tables, and which ones are the most relevant ones. The Dynamic Performance Views are views defined internal to Oracle that in fact map onto memory structures inside the SGA (System Global Area). They enable you to see in real time various data within the SGA using normal SQL queries.

When investigating performance problems you should generally narrow it down to a specific problem and the users being affected by it. In Oracle each connected user has a corresponding "session" inside the Oracle database instance, and information on each session is made available in the V$SESSION dynamic performance view.

This is both an advantage and disadvantage of the Dynamic Performance Views - they are a snapshot of current values in the SGA, and these values can be changing all of the time. Look again and you will get different results. Some views show cumulative values rather than current values, but they are always increasing over time and also constantly changing.

But with that in mind, you can use the V$SESSION view to look at what specific sessions are doing at the moment, and look for any slow, long running SQL statements. We can group the data columns into the following related groups:

Session Connection Information - User name, Client Application software, Logon time
Current Activity Information - Status, SQL statement executing
Delays - both normal Waits and abnormal Blocks i.e. small delays or completely blocked
- Waits are delays that will definitely finish e.g. read a block from disk
- Blocks are delays that are outside the control of the session and can potentially go on forever
  - e.g. blocked waiting for a row lock held by another session

For a full list of the columns in V$SESSION look in the "Oracle Database Reference" Manual under "Dynamic Performance Views", where you will get a brief description of each column.

All Active Sessions

Want a list of all active sessions and whether they are waiting or not? Try this query from SQL*Plus:

-- List the active sessions and break down key data
--
column sid              format 999999
column db_user          format a8      heading 'USER'
column db_login_time    format a6      heading 'LOGON'
column program          format a10     heading 'APP_N'
column active_for       format     999 heading 'ACTV|FOR'
column state            format a5      heading 'STATE'
column waiting_for      format     999 heading 'WAIT|FOR'
column event            format a12     heading 'WAIT ON'
column sql_id           format a13     heading 'SQL ID'
--
select s.sid, 
       nvl (s.username, 'SYS') db_user, 
       to_char (s.logon_time, 'hh24:mi') db_login_time,
       substr (s.program, 1, 10) program,
       s.last_call_et active_for,
       decode (s.state, 
                'WAITING', 'WAIT',
                'WAITED KNOWN TIME', 'NW L',
                'NW S') state,
       decode (s.wait_time, 0, seconds_in_wait, 0) waiting_for,
       substr (s.event, 1, 12) event, 
       nvl (s.sql_id, 'Not Executing') sql_id
  FROM v$session s
 WHERE s.type = 'USER'
   AND s.status = 'ACTIVE'
   AND s.wait_class != 'Idle'
/
--

Note the following about this query:

It restricts the sessions to only those from real users (not internal background sessions) that are active and experiencing non-idle waits.
- Idle waits are generally waits external to Oracle, such as waiting for the next SQL statement to execute
The number of columns of data output is kept low to fit within an 80 column wide terminal screen
- If you have a wider output format then you could add extra data columns to the "select"
The "decode" of "state" is meant to encode the following combinations of possibilities:
- WAIT = WAITING, NW = NOT WAITING; PREV WAIT:- S = SHORT, L = LONG
- i.e. it combines both whether the session is currently waiting or not, with whether the previous wait was a short or long wait
It is assumed that users have all connected today, so only the time of connection is shown
- If connections are long lived, then change the date format used in the "to_char" to include the day

From this you can see the following:

How many sessions are active, and how many are currently waiting on something?
- Sessions that are active and not waiting are therefore running on a CPU
How many sessions are running the same SQL statement? Look at the values of SQL_ID
How many sessions are experiencing the same kind of wait event?
Are sessions experiencing long or short waits most of the time?

One Session's Details

If you know the SID (Session ID) of one session you are interested in and want to drill down further, then you could run the following set of queries from within SQL*Plus:

--
set newpage none
--
set define on
undefine SESSION_ID
--
accept SESSION_ID prompt 'Enter Session ID > '
--
column sid          FORMAT 99999
column username     FORMAT a12
column logt         FORMAT a12 HEADING 'Logon Time'
column sql_start    format a9  heading 'SQL Start'
column last_call_et            heading 'Time at Status'
column module       FORMAT a16 heading 'Module'
column client_info  format a16 heading 'Client Info'
column command      format 999 heading 'cmd'
column taddr        format a8  heading 'TX Addr'
column server                  heading 'Server|Type'
column schemaname   format a12 heading 'Schema|Name'
column type                    heading 'Session|Type'
column sql_id                  heading 'Curr SQL ID'
column prev_sql_id             heading 'Prev SQL ID'
column lockwait     format a8  heading 'Wait Lock|Addr'
column event        format a25 heading 'Waiting For'
column wait_class   format a12 heading 'Wait Class'
column row_wait_obj#           heading 'Object|Waiting On'
column wait_time               heading 'Last Wait Time|(0=Waiting)'
column seconds_in_wait         heading 'Elapsed From|Last Wait'
column blocking_session        heading 'Blocking|Session ID'
column blocking_session_status heading 'Blocking|Sess Status'
--
select sid, 
       username, 
       to_char (logon_time, 'dd/mm hh24:mi') logt, 
       status,
       last_call_et,
       to_char (sql_exec_start, 'HH24:MI:SS') sql_start
from v$session
where sid = &&SESSION_ID
/
--
select sid, 
       substr (module, 1, 24) module, 
       substr (client_info, 1, 30) client_info,
       server, schemaname, type
from v$session
where sid = &&SESSION_ID
/
--
select sid, 
       sql_id, prev_sql_id, 
       event,
       substr (wait_class, 1, 20) wait_class
from v$session
where sid = &&SESSION_ID
/
--
select sid, 
       lockwait, 
       row_wait_obj#,
       wait_time, seconds_in_wait,
       blocking_session, blocking_session_status
from v$session
where sid = &&SESSION_ID
/
--
column sql_text format a62
--
select sql_id, sql_text from v$sql
where sql_id = (select sql_id from v$session where sid = &&SESSION_ID)
/
--
undefine SESSION_ID
--
set newpage 1

This gives you a lot more information from V$SESSION for the one session, broken down into related sets of data, as described before. Note that the final query selects from V$SQL which is probably the other main Dynamic Performance View, which has an entry for each SQL statement currently in the library cache in the SGA. Remember that a SQL statement in the library cache can be shared between many sessions.

One Session Detailed Drill Down

If you really want to drill down more into what a single session or a set of sessions is doing over a period of time then you could use Tanel Poder's snapper.sql. This snapshots the contents of V$SESSION and other Dynamic Performance views over a period of time and reports on what it captured. Mainly this will be the wait events experienced by the sessions, but a lot of other activity statistics are reported too.

Snapper is written in PL/SQL and uses two in-memory PL/SQL tables for each snapshot, and then lists out anything that has changed between them. It does not write anything to the database, and only needs minimal permissions - SELECT_CATALOG_ROLE and execute on DBMS_LOCK.SLEEP. Thus it is safe to run on any database.

Conclusion

I've shown how to list all the currently active sessions on a database including the SQL they are executing and what they are waiting on, and how to get more details on what just one session is doing.

If you want even more information on what a single session is doing then you can SQL Trace it, which creates a trace file on the database server of all the SQL statements executed by that session and all the waits it experienced while tracing was enabled. Nothing is missed out because every executed SQL statement from that session is traced. You can then use utilities such as "tkprof" to post process the trace file and report on the SQL statements executed, their elapsed times, and the waits they experienced.

If the slow down is not currently happening but was very recent then you can look back at the recent past using ASH (Active Session History), but note that ASH is an extra cost option - you must have the Diagnostic Pack of the Enterprise Edition of Oracle. I'll write something about using ASH in a future post.

GRUB, os-prober and Red Hat / Oracle Linux

2015-12-29T15:15:00.000+00:00

I've been successfully using VirtualBox to have test environments to run Oracle Linux and Oracle Database in from some time, but there are limitations to what you can do. So I decided that I wanted to install Oracle Linux onto another disk partition on my PC so I could dual boot into it for some more advanced Oracle Database tests. Well the Oracle Linux installation itself went ahead trouble free - I just had to remember to disable GRUB bootloader installation as GRUB was already installed from my other Linux (Arch Linux) - but then I ran into some problems trying to get this newly installed Linux added properly to the main GRUB boot menu I was using. This post describes why this occurred (after much digging around on my part), and a very quick and simple solution for it.

Detecting other Linux installations on other disk partitions to add to the GRUB boot menu is done on Arch Linux by installing the "os-prober" package which adds some extra scripts used by "grub-mkconfig". The specific problem is that while "os-prober" did detect the Oracle Linux kernels, they were being added to the GRUB menu in the wrong order - it looked like an alphabetical ordering rather than a numeric ordering by kernel version number. This meant that the first Oracle Linux kernel listed in the GRUB menu was not the latest one installed, and in fact more likely to be the oldest one instead or a rescue kernel.

To cut a long story short the problem is due to a combination of the code in the "/usr/lib/linux-boot-probes/mounted/40grub2" detection script and the contents of the "/boot/grub2/grub.cfg" file in the Oracle Linux installation. The "grub.cfg" file in the Oracle Linux installation uses some keywords that are not detected by the "40grub2" script in Arch Linux, so the bootable Linux kernels are not listed in the same order as they are in the source "grub.cfg" file. Instead it is the "90fallback" script that detects the bootable Linux kernels when it is run afterwards by "os-prober". Actually it is run by "linux-boot-prober" and it does a direct listing of Linux kernel files in the "/boot" directory of the other Linux disk partition, and adds each of these to the local GRUB configuration file. And the result of this is that the other Linux kernels are detected and added in alphabetical order.

Details on the Problem

The "40grub2" script works by opening the "/boot/grub2/grub.cfg" file from another Linux installation and looking for the entries for bootable Linux kernels. The idea is that "40grub2" will find Linux kernels in the same order they are in the "grub.cfg" on the other Linux installation, and they will be added to the local "grub.cfg" file in the same order. The benefit of this method is that the first Linux kernel listed for this other installation in the main GRUB boot menu will be the same one as listed by the other Linux installation itself. Which in turn means that if it sorts the Linux kernels in any way or puts a specific Linux kernel first as the default in its "grub.cfg" configuration file, then this is also reflected in the local GRUB configuration file of my main Linux installation.

The "40grub2" script works by opening the "/boot/grub2/grub.cfg" file of the other Linux installation and then reads each line in turn looking for ones that begin "menuentry", "linux" or "initrd". I believe that these are "standard" keywords that GRUB should be using. Unfortunately Oracle Linux is using keywords of "linux16" and "initrd16" instead, which means that these lines are not matched at all by the "40grub2" script and no bootable Linux kernels are matched at all. It seems that Red Hat on which Oracle Linux is based uses these keywords for some obscure, historical reason or other. Either way, they are used and they do not match what "40grub2" is looking for.

Instead the bootable Linux kernels are detected by the "90fallback" script when it runs afterwards, and they are detected in alphabetical naming order as mentioned before.

Solutions

There is a quick, easy and good enough solution you can do yourself, and then there is a more official solution.

First, you can just manually edit your local "40grub2" file and change two lines in it. Add a "16" variation to the lines in the "case" block that test for "linux" and "initrd". Here is the output from "diff" showing the before (<) and after (>) versions of the two lines I changed.


67c67
<    linux)
---
>    linux | linux16 )
80c80
<    initrd)
---
>    initrd | initrd16 )

Once edited run "grub-mkconfig" again to regenerate your "grub.cfg" file, and it should correctly pick up those entries from the other Linux installation now.

Second, it does not look like there is actually an official solution, which can often be the case with open source software. I found some bug reports about this problem but there was some finger pointing going on both ways between the GRUB people and the Red Hat people. It looked like the GRUB people felt that the official keywords were "linux" and "initrd", so it was a Red Hat problem to solve; while the Red Hat people felt that "linux16" and "initrd16" were valid in a GRUB configuration file and did work so it was a GRUB problem with the "40grub2" script.

One person did raise the question on how the main Linux that is adding these entries to its local "grub.cfg" file should be treating these entries with the "16" suffix from the other Linux. Should it ignore them and just use the normal keywords in its own "grub.cfg" file, or should it use exactly the same keywords? The latter solution is a problem because the keywords found in the other "grub.cfg" file are NOT returned back to the "os-prober" script i.e. it is assumed they are only "linux" and "initrd". Making "40grub2" return these extra keywords as extra data fields would need a lot of changes in other places - both "40grub2" and "os-prober" at least, and possibly others too if there is a common format used for passing around information on bootable Linux kernels.

So you can see how something that looks simple can grow into something much bigger, and could have significant changes to something as important as GRUB. And GRUB is a very critical piece of software used at system boot time, so no "obvious solution" should be rushed through without a lot of extra thought and testing. Unfortunately I don't know when we will get any kind of "official solution" to this.

Oracle In Memory and CPU Execution Efficiency

2015-12-28T15:09:00.000+00:00

Tanel Poder has been doing a series of posts on "RAM is the new disk" trying to show how the new "Oracle Database In-Memory" feature in Oracle 12c uses memory in a different and more efficient way than the normal, traditional buffer cache does. He hasn't finished the series of posts yet, but I started to draw some of my own conclusions from the data he published in his last post and I'd thought I'd publish my thoughts.

I know a reasonable amount about how CPU's work internally, so I assumed that the main performance improvement of Oracle Database In-Memory would be from a better memory access pattern that would significantly reduce the number of execution stalls inside the CPU. Tanel is one of the good Oracle bloggers, and he has been sharing the details of the tests he has been doing in this series of blog posts as well as the results from them so I can see if my hypothesis on the new Database In-Memory holds up.

Tanels' Tests

The main results are in this post where he gives the 6 SQL queries he ran and their results in terms of elapsed time and CPU execution cycles - some against the traditional buffer cache and some against the new In-Memory cache. No disk accesses were involved as all the data was already in memory in the Oracle SGA.

I'm only interested in the main 2 SQL queries that are the same as each other, except the first one gets its data from the traditional data block based buffer cache in memory while the second gets its data from the new In Memory column storage based cache i.e. both execute the same "SELECT" against the same table and all of the data needed is already in memory in the SGA.

3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)
SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid)  
FROM customers_nopart c WHERE cust_id > 0;

4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0
SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

Tanel has also made all the low level CPU execution measurements available in a Google online spreadsheet. Here is a cut and paste of the CPU measurements for just the 2 queries I am interested in:

Metric	TABLE BUFCACHE	TABLE INMEMORY PRED
task-clock-ms	27374	1578
cycles	86428653040	4573793724
instructions	32115412877	7080326242
branches	7386220210	940579984
branch-misses	22056397	4637243
stalled-cycles-frontend	76697049420	2251325295
stalled-cycles-backend	58627393395	1328333827
cache-references	256440384	11507915
cache-misses	222036981	7316366
LLC-loads	234361189	9712269
LLC-load-misses	218570294	7272805
LLC-stores	18493582	1697666
LLC-store-misses	3233231	27797
L1-dcache-loads	7324946042	1069917316
L1-dcache-load-misses	305276341	85368159
L1-dcache-prefetches	36890302	25169253

The most obvious observation is that the In-Memory query execution is 17.34 times faster than the traditional buffer cache query execution (elapsed task-clock-ms time ratio of 27374 / 1578). I'm most interested in trying to explain what has caused this difference in elapsed time.

My Hypothesis & CPU / Memory Access

At the beginning of this post I stated that I assumed that the main performance improvement would be a result of a reduction in the number of execution stalls inside the CPU. Let me try and explain what I mean by that.

In the very old days of Intel CPU's (from the original 8086 up to the first '486 I think) the internal clock of the CPU (the rate at which it executed instructions) was the same as the external clock of the system which also governed memory access times. A read from the CPU of a memory location would take something like 3 or 4 external clock cycles (put the address on the system bus, the memory to process that to its output, put the data out on the bus to the CPU). During these clock cycles the CPU cannot proceed with the execution of that instruction i.e. it stalls until the data it needs from memory has been read into the CPU.

During the '486 lifetime Intel introduced technology that allowed the internal clock of the CPU to run faster than the external clock i.e. at a multiple of the external system clock. Now we have external clocks of 100 MHz typically and CPU internal clocks of 3 GHz and more i.e. the CPU is running internally at 30 or more times the external system clock rate.

A side effect of this change is that now when a CPU "stalls" to read needed data from memory, the stall is for much longer in terms of CPU instruction cycles. As the external clock is now 30 or more times slower than the internal clock of the CPU, the CPU may end up waiting doing nothing for over 100 execution cycles. NB. This is a simplification of what really happens, but it does describe a real effect.

We can see this effect in the separately reported statistics for "CPU cycles" and "CPU instructions completed" - the first one will be much higher than the second one. The "efficiency" of how a CPU executed a given piece of program code can be seen from the calculated "instructions per cycle" value i.e. "CPU instructions completed" divided by "CPU cycles". The closer this value is to 1 the more efficient the execution has been within the CPU. The lower this value is then the more "stalls" occurred within the CPU during execution that stopped it completing execution of an instruction every clock cycle.

Of course CPU designers have tried to work around or reduce the effect of such stalls in various ways, the details of which are not relevant here. It just means that a) there is various technology in the CPU to try and minimise the occurrence and impact of such "memory load stalls", and b) the net impact of such "memory load stalls" is much less than the 100 or so wasted execution cycles I've described. In fact, and this is also important, modern CPU's can actually issue more than one instruction at the same time for execution within the same clock cycle e.g. an integer operation and a floating point operation. This can result in an "instructions completed per clock cycle" value larger than 1, which can occur when executing well optimised program code.

If my hypothesis is correct then we should see a difference in the number of "stalls" in the CPU execution statistics between the 2 SQL query executions, and the ratio between these 2 sets of "stall" statistics should be close to the observed ratio in execution times.

My Initial Conclusions

To cut to the chase a bit, the observed difference in execution times between the 2 SQL queries is not wholly explained by just a reduction in the number of "stalls" inside the CPU, but the reduction in "stalls" is a significant contributor to this reduction in elapsed time. The reduction in elapsed time is roughly a 50:50 split between a reduction in the total number of instructions executed (code efficiency) and CPU memory access stalls (memory access efficiency).

We can see from the measurements that Tanel has reported that the number of instructions executed by the CPU decreased by a factor of 4.54 (32115412877 / 7080326242) to process the same amount of data (rows in the table). This is a significant improvement in code efficiency to process this volume of data - probably a result of the In-Memory cache using far simpler internal data structures than the traditional buffer cache that can in turn be processed much easier.

This leaves an improvement factor of 3.82 (17.34 / 4.54) that needs to be explained by something else. I propose that this other factor is due to a reduction in the number of "stalls" within the CPU when executing these instructions as described earlier. This will have the net effect of increasing the "instructions per cycle" value during the execution of the SQL query, and indeed we can see this from the measurements reported by Tanel:

Buffer cache SQL query execution has CPU instructions per cycle of 32115412877 / 86428653040 = 0.37
In-Memory cache SQL query execution has CPU instructions per cycle of 7080326242 / 4573793724 = 1.55
This gives an improvement ratio of 1.55 / 0.37 = 4.17

And this value is very close to the 3.82 value for the other performance improvement factor.

Trying to drill down further into this difference in "instructions per cycle" and identifying "CPU memory stalls" due to data access is difficult if not impossible for a number of reasons:

Above the L1 cache on the CPU the other CPU caches are shared by both instructions and data. We cannot tell which cache reads and which cache misses were due to fetching an instruction to execute or fetching a piece of data needed by an instruction being executed. Both lead to CPU stalls but for different reasons.
The In-Memory cache SQL query is executing a completely different code path within Oracle with almost 78% fewer instructions being executed in total (100 / 4.54). So we cannot assume that the instruction execution profile and also instruction cache misses are similar in both cases.

We can however see other evidence that this other factor for the improvement in performance is due to a reduction in CPU stalls. Tanel has reported two other CPU execution statistics - "stalled-cycles-frontend" and "stalled-cycles-backend". Tanel gives descriptions for these in his part 2 post, which I will summarise as follows:

Stalled cycles front end occurs before an instruction gets executed by the CPU and is generally due to fetching an instruction from memory
Stalled cycles back end occurs during instruction execution and is generally due to fetching data from memory needed by that instruction

So the most relevant statistic for stalls due to reading data from memory is "stalled-cycles-backend". We cannot compare these directly between the two SQL query executions because they executed a different number of instructions (remember the 4.54 reduction factor). What we can do is normalise these to a "stalled cycle per cycle" value and then compare them:

Buffer cache SQL query execution has backend stalls per cycle of 58627393395 / 86428653040 = 0.68
In-Memory cache SQL query execution has backend stalls per cycle of 1328333827 / 4573793724 = 0.29
This gives an improvement ratio of 0.68 / 0.29 = 2.34

Again, this is a significant improvement and shows that the number of stalls due to memory accesses by the CPU for data has been reduced.

Why Have Stalls Reduced?

I hope that I have shown that part of the improvement in the performance of the In-Memory cache has been the reduction in the number of "memory access stalls" in the CPU when executing the SQL query i.e. less time wasted waiting to read data in from main memory with a corresponding increase in the number of instructions executed per clock cycle and less total elapsed to execute those instructions. But how has Oracle actually managed to achieve this? How has it managed to reduce the number of "memory access stalls" while executing the same SQL query i.e. if it needs to read the same "set of data" to perform the same SQL query, how has it managed to read it more efficiently?

I am guessing that this is due to two specific factors:

Data layout in memory - the layout is more "compact" which decreases the total amount of memory that must be read
Sequential access to that memory, causing the CPU to "prefetch" the next memory line into the CPU cache, so it is already there when it is needed i.e. overlapping the processing of one memory line with the fetching of the next memory line

Note that this is based on what I have read on In-Memory as I've not yet had a chance to do any real testing of it yet. Again, thanks to Tanel for both doing some real testing and for sharing all the results he got from those tests.

In the traditional buffer cache a data block will span multiple memory lines (assuming a 64 byte memory cache line, as Tanel states) and this will contain data for all the columns in a row. When you only need the data from one column it means that you are skipping about in memory quite a lot, jumping from row to row, and not using most of the data in each block or memory line. In the new In-Memory cache the data is stored by column, so all the data for one column is stored together, next to each other in memory. This immediately reduces the total amount of memory that must be read to access the data for one column across all the rows in the table.

It also will increase the cache efficiency within the CPU, because one memory line in the cache will contain data for multiple rows i.e. the CPU can process the data for that column from multiple rows before it needs to read in another memory line into cache to get the next set of rows' data.

If the data for a column is physically laid out sequentially and contiguously in the system's memory, then it means that Oracle will be doing a series of sequential memory reads to process all the data for one column. Modern CPU's can detect this kind of sequential memory access and have an optimisation to "prefetch" the next data line from memory while the previous data line is still being processed. The net result is that stalls are either eliminated because the next line of data is already in the CPU cache or the impact of stalls is significantly reduced because the data is already on its way from memory to the CPU cache.

We can see evidence for this occurring in the "L1-dcache-prefetches" CPU statistic. The "L1 data cache" is the first on chip cache immediately used by the CPU core, and is small but very fast. All data needed by the CPU goes through this cache. Again, normalising to a "prefetch per instruction executed" we can see that the number of data prefetches by the CPU increased significantly during the In-Memory cache SQL query execution:

Buffer cache SQL query execution has L1 dcache prefetch per instruction executed of 36890302 / 32115412877 = 0.00115
In-Memory cache SQL query execution has L1 dcache prefetch per instruction executed of 25169253 / 7080326242 = 0.00355
This gives an improvement ratio of 0.00355 / 0.00115 = 3.09

The End

Well that's been a longer post than I wanted, but I wanted to get everything together into one post as I'm sure Tanel will finish his series of posts soon and probably make similar observations. And I wanted to see if he would come to exactly the same conclusions or slightly different ones, and doing this post was the only way to do that. Thanks again to Tanel for doing the tests and sharing all the data with us.

Python for the DBA (2) - Data Unloading

2015-12-14T11:06:00.000+00:00

I've mentioned before that I like the Python programming language. Not only is it a "clean" language for writing programs in with a rich set of data types and structures for manipulating data, it also has a standard API for database access making it "database neutral". This makes it relatively easy to get data that is in a database into and out of your Python code where you can use the power of Python for whatever kind of data manipulation you need.

To show how easy it is to get data into a Python program I'll show a little utility to unload data from one table in a database to a flat, text file as one record per line with each data field separated by special delimiters.

Assuming the following:

Your Python program already has a database connection
- This is relatively easy, but does vary by specific database driver being used
And it has opened the output file to unload to i.e. an open file handle and not the file name
And it has the name of the table, the field delimiter characters and the record terminator characters in local variables

Then you can write a simple function to call that will unload the data from that table to the specified file:

def unload_table_to_file (db_conn, table_name, out_data_file, field_separator, record_terminator) :
    # Preparation steps before doing the real work
    # Replace literals of special characters in the separator and terminator
    record_terminator = record_terminator.replace ('\\n', '\n')
    field_separator = field_separator.replace ('\\t', '\t')
    
    # Simplest SQL to get the data, assume column ordering is correct
    select_SQL = "select * from " + table_name
    
    # Now do the work - read in all records and output them
    print ("=== Unloading . . .")
    select_cursor = db_conn.cursor ()
    select_cursor.execute (select_SQL)
    for data_row in select_cursor.fetchall () :
        # Convert Python native data types to simple string representations
        output_fields = []
        for data_field in data_row :
            output_fields.append (str(data_field))
        out_data_file.write (field_separator.join (output_fields) + record_terminator)
    print ("=== Finished")

The main lines are:

First make sure that if any "escaped characters" are in the delimiters then they are the real escape characters
- "\t" means the TAB character, and "\n" means the NEWLINE character
We just use a "select * from table" to get all the data into the Python code
- The String concatenate operator is just the "+" operator i.e. "+" is overloaded depending on data type
And we execute this "select" via a Cursor created from the database connection passed in
The real work happens in 2 nested loops - for each data row, for each data field in each data row
- The fields from one record are put into a list (essentially equivalent to an array)
- This is so we can use a single action at the end to concatenate them all together with each data field value separated by the "field separator"
- Then we "write" out this concatenated list of fields to the output file
The "str" function will return the String representation of a data value e.g. convert an integer to its String representation
- This assumes that there exists a data type specific "str" method for each possible data type
The "join" method of a String variable iterates over a list of data, and returns one String of the list contents concatenated together with the value of the String variable between each one.

I think this shows some of the power and elegance of Python that reading all the data from a table can be done with so few lines of code, and that the code itself is so readable. And this code is flexible enough to deal with different values for the field separator characters and the record terminator characters. You could just do a classic comma separated list (',' as the field separator) or something more complicated to avoid issues where a comma may appear in the data itself e.g. "|__|" as the field separator (vertical bar, underscore, underscore, vertical bar). And this flexibility is all handled by this Python code.

Caveats

This is probably the most simple Python code to achieve the goal of reading all the data in a table and writing out to a flat, text file. There are several things it does not do at all, and other things that could be enhanced with it.

It is assumed that all data types being used can be cleanly converted to valid String representations using "str()"
- This may not be true for some data types, or you may wish to handle some data types differently e.g. Dates
Dates should probably be handled differently, to ensure the full date and time values are output
- You could execute an "ALTER SESSION" SQL statement to set the value of NLS_DATE_FORMAT to something suitable
- Python has a native Date / Time data type, "datetime", which Oracle DATE columns are mapped to, and you could test for this
You could add extra functionality such as debugging / tracing messages (use the "logging" module) or other progress messages
- You may wish to add extra counters to count the number of records and fields per record