Saturday, 19 January 2019

Time SQL Execution with Python

I've said before in this blog how I find Python to be very useful for doing various things, including processing data to or from an Oracle database. Here is another example where a relatively simple and straightforward piece of Python code can deliver something that is very useful - in this case measuring the elapsed time of SQL queries executed on an Oracle database. We often need to execute a given SQL query and see how long it takes to complete as part of a tuning or monitoring exercise, but running it via SQL*Plus or something else can tie up a session for an extended period of time and produce large amounts of output that are spooled to the screen. Conversely we don't want to run the SQL query and stop when only the first data row is returned, as that is not a "full" execution of the query. We can use Python to achieve this - execute a query to completion, consuming all the data it produces but not outputting any of it, and measure the elapsed time. And we can execute this outside of any SQL*Plus or SQL Developer or other window we might be using.

The key requirements for this utility are:
  • Connect to an Oracle database as a specified user
  • Execute the given SQL query, unmodified, so the execution plan used should match that when the same SQL is submitted by the real application
  • Consume all the rows of data produced, for a true time to complete measurement
This covers most scenarios but does leave a few out - no bind variables can be used, for instance.

There are three parts to achieving this really
  • Connect to the Oracle database using the supplied credentials
  • Read in the SQL query text from a file
  • Execute against an Oracle database connection and consume the results noting the start and end time of this
And each of these parts involves relatively few lines of Python code to achieve what it does. And part of the reason for this is Python's dynamic data typing of variables - you don't have to declare the data types of the data the SQL query produces in advance of retrieving it. Which means that it can handle almost any query producing any number of values per output row.

The code I will now present is not "perfect" in an absolute sense - there is very limited error handling for instance - but it is complete and does work. I've used it many times to execute a SQL query pulled out from an application that is performing poorly, to verify its execution and the impact of any changes made to try and improve performance.

Python Code

Connecting to an Oracle database
I use cx_Oracle as the database access module for Oracle, and connection is relatively simple. This assumes that the database credentials are in variables named "user_name", "user_password" and "database_name".
import cx_Oracle

# Other code here, such as parsing command line arguments to get values into necessary variables

db_conn = cx_Oracle.connect (user_name, user_password, database_name)

Reading in the SQL query from a file
Assume that the name of the file containing the query is in the string variable "SQL_File_Name".
sql_query = open (SQL_File_Name, "r").read ().rstrip (' ;\n') # Strip any semi-colon off from the end
This does several things in one combined statement:
  • Opens the file for reading
  • Reads in the content of the file returning it as a single string
  • Strips off certain trailing characters at the end of the string - space, semi-colon or new line character.
We now have in the "sql_query" variable the text string of the SQL query with no trailing semi-colon (if there was one). This is now ready for use to execute against an Oracle database.

Executing the query and timing it
For now assume that a database connection has been made and is referenced by the object named "db_conn".

To execute the SQL query we first need to get a cursor from the database connection, then execute the SQL using it, retrieve all the rows of data, and report the elapsed time:
import time

try :
    curs = db_conn.cursor ()  
    start_time = time.time ()  
    curs.execute (sql_query) 
    # Now retrieve the rows of data produced by the query
    row_count = 0  
    rows = curs.fetchmany (1000)  
    # Empty list will be False so can test it directly  
    while rows :  
        row_count += len (rows)  
        rows = curs.fetchmany (1000)  
    end_time = time.time ()  
    print ("Query Complete - Elapsed {0:4.1f}s for {1} rows".format ((end_time - start_time), row_count))  
except Exception as ex :
    logging.critical ("ERROR - Exception raised during processing")
    logging.critical ("Exception - %s" % (ex))
I've included exception handling in case there is something wrong with the SQL query or its execution, but there is not a lot we can do about it so it just prints out some information about the exception it caught.

More details
I actually allow all the necessary values to be passed in from the command line to this utility program, and I use the Python "argparse" module to process these and map them to corresponding variables. So you can either assign the results from "argparse" to the discrete variables I have used in the code examples, or instead replace those references by the results produced by "argparse". The results will be the same either way.

The following code can be placed before the code given so far in the final Python program:
import argparse
import sys

# Parse command line arguments
parser = argparse.ArgumentParser (description="Time execution of a SQL query")
# Add command line argument definitions one by one into the parser
parser.add_argument ("-u", "--user",      help="User name", required=False, default=None)
parser.add_argument ("-p", "--password",  help="Password", required=False, default=None)
parser.add_argument ("-d", "--database",  help="Database name", required=False, default=None)
parser.add_argument ("SQL_File", help="Name of SQL File to execute", default=None)
args = parser.parse_args ()
# First check whether we did get a user name, password etc on the command line
# If not, prompt the user for what we are missing
if args.database is None :
    print ("Enter database", end=": ", flush=True)
    database_name = sys.stdin.readline ().rstrip ()
else :
    database_name = args.database
if args.user is None :
    print ("Enter username", end=": ", flush=True)
    user_name = sys.stdin.readline ().rstrip ()
else :
    user_name = args.user
if args.password is None :
    print ("Enter password", end=": ", flush=True)
    user_password = sys.stdin.readline ().rstrip ()
else :
    user_password = args.password
SQL_File_Name = args.SQL_File

# Get Database Connection ...

# Execute SQL and consume results ...

Summary

This relatively short and straightforward Python program gives me a utility to time the execution of any SQL query against an Oracle database, including consuming all of the data rows the query produces. For me this has many advantages over other ways of executing SQL queries when investigating performance problems, and is something I use whenever needed. One advantage is that because it is executing the original SQL query with no modifications, it will be using the same execution plan as when that SQL query is executed from an application. Which in turn means that if I do modify the SQL in some way and manage to "improve" its performance, the new execution plan should be the same one used when the same improved SQL is executed from the application code.

2 comments:

DR said...

Are you available to discuss the idea of writing some Python code to parse data from Trello with a CRUD solution? I think it would simplify a lot of the challenges in this space and actually allow society to move on into really being productive.

DR said...

I have found a way to do this with a simple Pandas json function to read the Trello file with a formatted URL.json in my code.