Monday, 25 June 2018

Interactive Complex SQL Scripts - A Solution

Often you will find that you need to run a relatively simple SQL query and see the results from it. Normally your choice of how to run this SQL query in Oracle is a straight choice between either using a plain SQL script executed through SQL*Plus, or using PL/SQL (probably in an anonymous block). But often things are not quite that simple or straightforward for you, and maybe there is an extra little bit of complication in how that SQL needs to be executed. Maybe you need to execute it repeatedly with a series of different values to use in a filter. Maybe you need to ask the user to enter a value, or choose a filter to use. Maybe you want to loop through the results of one query and run a second query for each of these. Maybe you need some simple conditional logic on the results and use that to choose which second query to run. Maybe you need to provide progress feedback to the user as a series of queries are executed, or as the results are retrieved of each query.

Unfortunately both SQL*Plus and PL/SQL each have drawbacks that stop them from being the "right" solution for all possible SQL query scenarios. I believe that I have identified a better solution for such cases where the query is straightforward enough and the conditional logic is also straightforward - use Python to execute the necessary SQL queries instead. It will do what you want, it can do much more than either SQL*Plus or PL/SQL can do themselves, and lets you be consistent in how you run all such non-simple SQL queries. I'll explain more of the details why I believe it to be a better fit for running ad-hoc complex SQL queries in a moment.

SQL*Plus or PL/SQL each have their own advantages for running different types of SQL query, but each also has a set of clear disadvantages that make them unsuitable for some types of SQL query.

SQL*Plus has its own set of features that make it quite flexible for interactively running SQL queries and seeing the results. These include things like "substitution variables" and prompting the user to enter values, and controlling formatting of the results output. It can execute any valid SQL statement, including DDL to do things like create new database objects. I find SQL*Plus very capable for writing an interactive script that can show the user some data, prompt them to enter a value, run other queries using that value as a filter, and display each result of each query as it gets them.

However, SQL*Plus has no direct "if then else" conditional logic support. People may find one way or another to achieve this, but it is not intrinsically a native feature of SQL*Plus, and such workarounds can look like kludges and be difficult to read and understand. It also has very limited error detection and handling abilities - basically it can stop if anything goes wrong, but that's about it. So there are clear limits to the abilities of SQL*Plus for executing SQL queries with any kind of conditional logic between them.

PL/SQL is Oracle's procedural language extensions to wrap such conditional logic and more around the execution of SQL queries, with good error detection and handling capabilities. It is definitely the language to use for things like stored procedures and triggers stored in the database itself. However as a general purpose language for executing SQL queries it has a number of severe limitations. The main one you keep running into is that it has little to no real input or output abilities for user interaction. It has no user input capabilities because it is designed to run within the database instance itself on the database server system, for which the concept of a real user at a screen with a keyboard does not make sense. So no real fault on PL/SQL's part for that. It has a very limited user output capability (DBMS_OUTPUT.PUT_LINE), because none of the output is visible to the user while the PL/SQL block is running. Instead all the output appears all together at once at the end when the PL/SQL has completed. Which is of course too late by then for progress and status reports.

PL/SQL also only supports DML SQL statements as native SQL statements in its code, partially because all native inline SQL statements are validated for correctness when the PL/SQL block is initially compiled i.e. when it is all parsed when submitted for creation or execution, and DDL is generally impossible to properly validate for various reasons. The workaround is to put such DDL SQL statements into local string variables in PL/SQL and then perform an "EXECUTE IMMEDIATE" on that string. While this does work, there are two major disadvantages:
  1. You lose the validation of the SQL at compile time when the PL/SQL is submitted, and instead can get a run time error when that particular SQL DDL is executed
  2. Your SQL DDL statements now become quoted strings rather than plain, inline SQL statements, making reading them more difficult (for two related reasons)
    • Recognising such SQL statements when embedded within string variables when reading the PL/SQL code
    • Complications when the SQL is very long and gets split over many separate lines, and has "variable" parts within it dependent on the values in other variables
A solution that I think deals with all the negatives of both SQL*Plus and PL/SQL while also supporting execution of any SQL statement is to use Python. I believe that it does not suffer from any of the problems that SQL*Plus or PL/SQL suffer from individually (i.e. it successfully deals with those particular issues and they are no longer problems in Python), while being a modern and relatively cleanly designed programming language. Thus you can use one programming language all the time for such one off complex SQL query scenarios, providing more consistency for you when doing these.

About the only negative I can think of for using Python is that SQL statements have to be put into string variables in the Python code and then passed to an "execute" method (a library function) to run it against the connected database. Which means that you do not get compile time validation of the SQL itself that PL/SQL might give you, and instead will only get run time errors when it tries to execute each SQL query in turn. But PL/SQL suffers from the same problem for DDL SQL and certain other variations of DML SQL (such as a variable list of column names in the SELECT), so you are no worse off really by using Python.

But in all other areas I believe it is superior to SQL*Plus and PL/SQL, and so worth the trade-off for what you gain (which is quite a lot) versus what you lose (SQL validation at compile time). Furthermore, if you use Python for all such SQL scripts you need, then you get a more consistent library of your own scripts to use. Often I start writing something in straight SQL + SQL*Plus and run into a dead end because of something like conditional logic. So then I have to rewrite it all as an anonymous PL/SQL block and add in a bunch of PL/SQL variable declarations. Using Python all the time means you are only using one language all the time, not two.

Note that I am not saying to use Python for all SQL executed against your Oracle database. Clearly things like business rules and logic should be in stored procedures in the database written in PL/SQL. And very simple SQL queries can be executed directly using SQL*Plus, I'm on about one off sets of non-simple SQL queries you want to run, or a report you want to be very interactive with user prompts for data values and feedback as the script progresses. In these cases I believe it is better to use Python (or maybe another programming language with SQL database execution support) than a mix of SQL*Plus and PL/SQL in different scripts.

Having proposed Python as the solution to the failings in both SQL*Plus and PL/SQL let me also add one more reason to consider using it. It is relatively easy to come up with a small set of templates for query execution that satisfy most possible scenarios. Then it is a matter of copying the appropriate template you have, editing that to add in the specific SQL queries you want executed, and then running that Python program. And a lot of this ability and power of Python is because it is a dynamic language - you do not need to declare variables in advance of using them in your code. This not only reduces the total number of lines of code needed for anything, but it also simplifies many of the issues of having to formally declare a named data type for a variable. Python will normally work out the data type needed from the source data itself that is being stored into the variable, so the same identical Python code can often be reused to process completely different source data types, which is a very powerful feature. Changing a data type in PL/SQL can involve a lot of changes to your source code. And this feature of Python becomes more powerful the more you become familiar with it and learn how to leverage it into truly generic code.

Python Examples

Here I will provide some Python code snippets that form a template and can be put together to run a SQL query for you. Note that I'm using Python 3 syntax which has a few key differences to Python 2 syntax e.g. "print" is a true function in Python 3. Also note that in Python the indentation level of statements is key, and tells Python when nested blocks of statements start and end. Indentation is not just about a pretty layout - it also tells Python how to interpret the meaning of your source code. And remember that variables do not have to be defined before you use them - Python determines the actual data type of a variable at run time when a value is stored in it.

First you need to connect to the database in question, and it may be easier to do this by prompting the user to enter their credentials rather than putting them on the command line when invoking Python.
#!/usr/bin/env python
import sys
import time
import cx_Oracle

print ("Enter database", end=": ", flush=True)
database = sys.stdin.readline ().rstrip ()
print ("Enter username", end=": ", flush=True)
username = sys.stdin.readline ().rstrip ()
print ("Enter password", end=": ", flush=True)
password = sys.stdin.readline ().rstrip ()
Note how the "sys.stdin.readline()" method is used to read from standard input what the user has typed. And how the string object method "rstrip()" is immediately used to strip off the newline character that is included in what "readline()" returns but which we don't need.

Personally I like to run a trivial query against the database immediately to show that the connection was successful and the database is up. In this example it queries both V$DATABASE for the database name, and then V$VERSION for the database software version.
print ()
print ("Connecting ...")

conn = cx_Oracle.connect (username, password, database)
print ("Connected successfully")

curs = conn.cursor ()

# V$DATABASE has only 1 row in it
curs.execute ("select name from v$database")
rows = curs.fetchall ()
print ("Connected to database {0}".format (rows [0][0]))

# Just the one row from V$VERSION
curs.execute ("select banner from v$version where banner like 'Oracle Database%'")
rows = curs.fetchall ()
print (rows[0][0])

print ()
Note how Python can retrieve all of the results at once into a nested set of arrays (lists as Python calls them) - a list of rows, and each row is a list of columns.

Also note that I am using the "cx_Oracle" module as the database driver for connecting to Oracle. Python defines a common database API that can be implemented by different driver modules for connecting to different databases. The "cx_Oracle" module is a third party module commonly used for connecting to Oracle databases. You can download a copy of "cx_Oracle" or another driver module and add it to your Python installation.
# Now do any real work we want to do, using the database connection just obtained
sql_query = """
select banner
  from v$version
print ("Executing SQL Query")
curs.execute (sql_query) 
rows = curs.fetchall ()
for i in range (len (rows)):
   print (rows[i][0])
The query here is just another trivial query against V$VERSION, but it could be anything you want it to be. It is very simple to iterate over all of the rows retrieved by a query in a "for" loop (len tells you the number of members in a list) and then print out the data retrieved. This is one way to iterate over the results by using "fetchall ()" to fetch all of the rows back explicitly, but there are a few other variations on this you can use depending on what you want to achieve and your preferences e.g. you can also iterate directly over the cursor itself which gives you each row of data values in turn.

Note how Python supports two different variations on defining string literals. Short string literals can be put inline between pairs of matching quotes - both single and double quotation characters are supported - and there are multiple cases of this here. Long string literals can be put between matching pairs of triple quotation characters and split over multiple lines, which is how I have done the SQL query here. The new line characters present in your source code are retained within the string value i.e. they are not stripped out when Python parses such a string literal value.

To execute a query in a loop with different filter values you should use bind variables rather than string concatenation (which is prone to SQL injection issues). In "cx_Oracle" we can use a leading colon in front of the bind variable name in the SQL query.
sql_query = """
select count (*) cnt
  from sales_history
 where period_id = :period_id
  GROUP BY product_id,
having count (*) > 1
for pid in range (1, 100) :
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : pid} # A dictionary of bind variable name to value mapping
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Period {1:d} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), pid, dup_count))
Note the use of the "range" function which returns a list of values from the first value to one less than the last value.

Lets say you don't know the list of "period id" values to iterate over (1 to 99 in previous example), and instead want to get those from another query. You can do that too:
# Calendar contains per day entries, but month ids are not contiguous
# So Select the distinct month calendar id ordered descending and iterate over those

monthcurs = conn.cursor ()
monthcurs.execute ("select distinct month_calendar_id from calendar order by 1 desc")

for monthrow in monthcurs :
    cmid = monthrow [0] # First and only column in each output row
    # Get list of duplicate counts, then count those separately and output
    curs = conn.cursor ()
    bind_vars = {'period_id' : cmid}
    curs.execute (sql_query, bind_vars)
    rows = curs.fetchall ()
    dup_count = len (rows)
    print ("[{0}] Month ID {1} has {2:d} duplicates".format (time.strftime ("%H:%M:%S"), cmid, dup_count))
The above shows how Python lets you directly iterate over the rows in the first result set (for monthrow in monthcurs:) and using a value from each row execute the second query to count duplicates for that one period.

No comments: