Databases and Performance: 2012

Sunday, 22 July 2012

Oracle 11gR2 on Arch Linux

The Oracle Database software is only "supported" on Red Hat Enterprise Linux and Oracle Enterprise Linux I believe. But it can be made to work on many other Linux distributions one way or another, as a lot of the stuff is the same or equally available on other Linux distributions. While this is not supported or recommended in any way by Oracle, it can be useful if you have a different Linux distribution installed on a test system for one reason or another and cannot change it. Here I will present the recipe for how I got Oracle 11g Release 2 installed on Arch Linux. Remember that none of this should be used in any kind of production like environment as it is unsupported.

Arch Linux

Arch Linux is a Linux distribution based on a rolling release model. This primarily means that there is only ever one version of Arch, being the "latest version", and whenever new releases of software packages are made by the developer they are then added to the Arch repositories after a small delay. This means that you can always get the latest version of any available software package on Arch, and that there is never a big upgrade or reinstall hassle with Arch every 6 or 12 months. Instead you regularly do a system upgrade which pulls down the latest versions of all installed packages from the repository, so that your system is always "up to date".

Another thing about Arch is that it is highly customisable to whatever combination of Linux software applications you want. It does this by only installing a "bare bones" set of Linux packages that gives you a text only console with a "login" prompt (classic old school UNIX). From this you then install all the extra packages you want, to arrive at your customised Linux, just as you want it. No bloat from other applications you never wanted in the first place. This means installing and configuring Arch is more than just booting from a Live CD and clicking on "Install". But Arch has an excellent Wiki including an "Installation Guide" that explains everything.

Arch has its own package manager (pacman) that handles all of the downloading, resolving dependencies and installation of the packages. So if a package is available you can just install it and everything it needs through one simple command.

I'm assuming for this article that you already have Arch Linux installed, and know enough about how to install it. And that you have installed Arch plus X Windows plus a full Desktop Environment (e.g. Gnome or Xfce) plus a Display Manager to handle the initial user login (e.g. GDM). If you don't use a full blown Desktop Environment then you will need the equivalent combination of Window Manager, Application Menu, etc., which is also possible in Arch.

Extra packages

Oracle 11g requires "gcc" which is in the "base-devel" package along with other development tools, such as "make". Install this, plus some other needed packages:

pacman -S base-devel elfutils libaio unixodbc sysstat pdksh icu gdb
pacman -S libstdc++5

The last one - "libstdc++5" - is needed because the make files provided with Oracle make explicit reference to this. Since Oracle 11g was originally compiled on the current at that time version of Red Hat Linux the standard C library has moved on to version 6, but Oracle still wants something called version 5 when installing.

Symbolic links

Oracle expects some commands to exist in the /bin directory when they are now in the /usr/bin directory as standard. You can work around this by creating symbolic links from one location to the other for these files:

ln -s /usr/bin/basename /bin/basename
ln -s /usr/bin/tr /bin/tr
ln -s /usr/lib/libgcc_s.so.1 /lib/libgcc_s.so.1

Other symbolic links are mentioned by other people for other Linux distributions, but for Arch these are the only ones I needed to make - other links already existed for some other files. Also bash is the default shell on Arch, which is the same as on Red Hat.

Normal Oracle Pre-Installation

Configure the Linux kernel and other settings as stated in the Oracle installation instructions. This includes:
- Settings in /etc/sysctl.conf
- Settings in /etc/security/limits.conf
- Adding a line to /etc/pam.d/login if not already present
Create the groups needed (oinstall and dba), and the "oracle" user account.
Create the directories where the software will be installed e.g. /apps/oracle and /apps/oraInventory
I create a ".profile" file for the "oracle" user and set ORACLE_BASE, ORACLE_HOME based on the installation directories. Also set ORACLE_SID, and add ORACLE_HOME/bin to PATH.

Install Oracle 11g

Get the Oracle 11g software distribution - downloadable from oracle.com - and unzip the files to another directory. Run the installer and make your choices from the screens it presents - I normally go for "Install Software Only" and "Single Instance".

You will get the following issues when running the Installer:

All prerequisite checks will fail. This is because Oracle uses "rpm" to find out what packages are installed, and Arch does not use "rpm", so Oracle thinks everything is missing.
- Providing you have installed the packages listed earlier, simply click on the "Ignore All" check box on the top right of the screen, then click "Next".
- Click "Yes" on the pop-up dialog box that warns you about ignoring prerequisites. Then click "Install" as normal on the final screen
You will get a failure message during installation about "target agent nmhs", from the make file "sysman/lib/ins_emagent.mk".
- This only affects the Enterprise Manager Agent, which is generally not used, and so can be ignored.
- It is due to a change in how "gcc" scans libraries to find referenced symbols in a program.
- If you really want this Agent, a workaround is possible by editing the makefile - just Google it to find a solution.
- Assuming you don't want this Agent, just click "Continue" to ignore it can carry on.

Eventually you will get the normal pop-up dialog box about running the "root.sh" scripts as root, and then the installation is complete.

From here you can go on and create a database using the Database Configuration Assistant (dbca) or the "CREATE DATABASE" SQL statement, whichever you are more familiar with.

Tuesday, 17 July 2012

Python for the DBA (1)

I really like the Python programming language for general purpose programming and quickly putting together little utilities. There are similarities to Perl in terms of being an interpreted language with powerful data processing capabilities, but there the direct comparisons end. Python is a more modern language than Perl, and has a relatively clean design (nothing is ever perfect though, and Python itself has evolved over the years). The main reasons I prefer Python over Perl are:

Very clean syntax - no need for obscure characters in front of variable names and so on
Very readable code - it is generally clear what a piece of code is doing
Minimal syntax dressing overhead - Python avoids the need for begin / end block statement markers
True support for functions

Perl just pushes all the arguments into a single list that you must disassemble

Strongly typed - Python supports multiple data types and checks at run time that operations are valid
Rich set of types - number, string, list, dictionaries, plus others e.g. set, tuple
Supports classic "function" based programming - just like 'C' does
- Very simple to write code to directly do what you want, and modularise common code into functions
Also supports full "object oriented programming" - full support for Classes as first level objects
- Perl does not truly do classes as first level objects

I'm not trying to say that Python is better than Perl in all cases, or other programming languages. I'm just saying that Python is a very good and usable programming language, and that I prefer it over Perl now. With Perl I find it gets very confusing when you try and use any level of complexity and the syntax is not obvious or consistent, whereas with Python I don't get any of this because it is such a cleanly designed language. With Python I can easily build a small application by just putting together the necessary source code while using functions for modularity (what I call "just build and run"), avoiding the overheads of defining classes with fully object oriented programming. However, if I have a more complex set of requirements then I can do a full blown object based solution using classes within Python (too complex to just build, so some initial design is needed, probably with test cases too).

Python is very useful for a variety of different scenarios:

building a small application or utility by just writing the code - no "compile and link" steps as it is interpreted
building something iteratively (top down design & development) as your code is always runnable (being interpreted)
prototyping object based code quicker and easier than compiled languages such as Java
ability to extend Python with your own libraries - it is written in C and can call your own compiled libraries
quickly prototyping real Java code that uses standard Java Classes in the JVM via Jython

Jython is a Python interpreter written in Java
this also uses far fewer lines of source code than would be needed in Java, because of Python's fundamentally different design

Again, some of these capabilities are not unique to Python, but the combination of them all together make it a winner for me.

Python Overview

Python should be straightforward enough to understand on reading it. The only major difference to other languages is that Python does not have begin / end statement block markers and instead relies on statement indenting to determine which statements are part of the same block. You will see that statements with embedded statement blocks in them have a colon (':') at the end of the first line, and the following lines are indented. Initially it takes some getting used to, but it does end up with less typing on your part and a consistent code layout.

Although Python supports what I call direct coding (write some statements and then just run them), it is really fully object oriented behind the scenes. Thus you see many standard functions actually returning objects, against which you invoke one of their methods (see Database Example below). The syntax of "object.method (arguments)" occurs frequently, rather than the non-object way of "function (object, other-arguments)".

Python Development

You could edit and run Python programs from the command line ("python filename.py" - the ".py" extension is just a common convention), or you could use an Integrated Development Environment, such as Eclipse with the PyDev plugin. The nice thing about this is that it does syntax checking for you as you type, so you can easily spot various errors before you run your code. And you can run your Python application from within the development environment, so you don't have to leave the editor each time you want to test something.

Python Database Access

Like Perl, Python defines a standard API for database access. The intention is that developers can write code based on this standard API, and a variety of drivers can be provided for access to different databases. By changing the driver used you can connect to different databases, but the bulk of your code remains the same.

There are a bunch of drivers out there for Oracle, but the most common one seems to be cx_Oracle.

Python Database Examples

Using Python is relatively straightforward. I won't describe the database API in detail, as it is obvious when you use it. In summary, you call a "connect" library function with connection details and get back a connection handle (really an object). Using this you can then create a new statement handle, execute it, and fetch back any data or other results. One neat thing is that Python has a "fetchall" method that fetches back all data rows for a SELECT into a single list (actually a list of rows, each row being a list of columns). This then lets you traverse the data using Python logic, with no more fetch calls. I assume that there are efficiencies with the "fetchall" call over how the data is transferred back over the network, but I have not done any tests about this. It will of course end up using more memory within your Python program to hold all the results fetched.

Example

Connect to an Oracle database and execute a SELECT from a table, printing out the data fetched.

from cx_Oracle import connect

conn = connect ("scott", "tiger", "orcl")
# Or: conn = connect('scott/tiger@orcl')
curs = conn.cursor()
curs.execute("SELECT name, value FROM atable")
rows = curs.fetchall()
for i in range(len(rows)):
    print "Row", i, "name", rows[i][0], "value", rows[i][1]

Notes:

len is a built in function that returns the number of entries in a list
range is a built in function that produces a list of numbers from 0 to one less than the supplied value
List members can be accessed using array like notation

Thursday, 12 July 2012

SQL*Net Wires Crossed

I was trying to set up Data Guard between two servers and I kept getting connection errors from RMAN on the primary to the secondary:

RMAN>  connect auxiliary sys/syspassword@dgtest_stdby

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-04006: error from auxiliary database: ORA-01017: invalid username/password; logon denied

I tried to validate everything and it looked okay ("tnsping" worked), but I could not connect through SQL*Net at all, whether from RMAN or SQL*Plus. Local direct connections worked fine though. Eventually I managed to work out what was wrong. So in case anyone else runs into the same problem ...

What I had originally was:

A newly created database (using CREATE DATABASE the old fashioned way)
An Oracle password file created using orapwd with the SYS password
A listner.ora on each server defining a static service name for the database on that server
A tnsnames.ora on each server defining the mappings of connection names to servers and services

The problem was that I had never explicitly set a password for the SYS user in the newly created database. I could still connect locally, which I assume is because I was in the "dba" group in the operating system. But I could not connect remotely no matter what password I used. So the solution was:

ALTER USER SYS IDENTIFIED BY NEW_PASSWORD
This automatically updates the $ORACLE_HOME/dbs/ora$ORACLE_SID password file on the server
Copy (ftp) this file over to the second, standby server

Now I could connect over SQL*Net to either database.

I would say that the error message is not particularly helpful, as it seems that all manner of different error conditions can give rise to the same error message. And there is almost no way of investigating this, other than trial and error - which I was getting sick of after 2 hours of getting nowhere. I knew that the underlying SQL*Net worked, as tnsping worked, and I knew that the Listener configurations were correct. Eventually I simply decided to change the SYS password to be something completely different and lo and behold I could now connect to the primary database remotely. It then took another few minutes before I realised that I needed to copy the Oracle password file to the standby server again, because it had changed.

Monday, 2 July 2012

All Outer Joins Are Not Equal

When executing SQL involving ANSI syntax OUTER JOIN's Oracle first converts this to its own native syntax before executing the SQL, and sometimes this conversion is not totally transparent thereby affecting and restricting the execution plan produced.

This started when I had a long running query, and it happened to involve outer joins between most of the tables (but not all of them). From the execution plan it was clear that the main problem was a full table scan on a very large table - out of a total cost of 35,000, 28,000 was from this full table scan alone. So replacing the full table scan by a more efficient index access, if possible, could lead to a significant improvement.

After some analysis I created an index on all of the referenced columns in the table. This meant that all necessary data for the query could be obtained from the index, with no need to visit the data rows in the table at all. As the data row was quite wide (180 bytes), the index was much narrower and much smaller in size. Also, because the index stores data in sorted order by the leading columns, it meant that the rows the query wanted would all be stored together within the index structure, further reducing the number of disk I/Os needed to retrieve it all. But, the Oracle Optimizer would not use this index at all. It just seemed to ignore it, and always do a full table scan.

At this point I tried another of my techniques for understanding how a query gets executed - remove tables from the query one at a time, until you get a more "sensible" execution plan, and then build it back up again to the final query by adding a table at a time, monitoring the execution plan after each change.

What I noticed was that when I got down to 3 tables from the original 7, the Optimizer would now choose to use the new index (doing an Index Fast Full Scan, at a lower cost of 4,000). But when I added back in a fourth table, it went back to doing a Full Table Scan at a cost of 28,000.

I did a 10053 trace on the execution of each query (3 table and 4 table queries) and noted that in the SINGLE TABLE ACCESS PATH section of the 3 table query it was costing an Index Fast Full Scan (IndexFFS), whereas in the 4 table query it was not costing this access path, and only had the Full Table Scan costed.

At this point while I could see what was going on (in one case the Optimizer knows about the Index Fast Full Scan as an option, but in the other case it doesn't) I could not explain why it was happening, nor what to do about it. So I turned to the Oracle Forums and raised a question about Why Optimizer ignoring Index Fast Full Scan? While waiting for replies I also continued to tinker with the queries in the hope of gaining some extra understanding of what was going on.

The cause of the problem was identified by Dom Brooks as being the fact that Oracle internally rewrites ANSI syntax OUTER JOIN's to its own native syntax (using "(+)") before then optimizing and executing the query, and that this can result in some complex SQL that cannot be optimised as you might otherwise expect. Also see his earlier reply which gives more background information and links to other explanations about this.

When it rewrites the ANSI syntax JOIN's to Oracle native syntax, it wraps each join up in its own in-line view, to ensure that it is equivalent to the ANSI meaning. It then tries to merge these inline views back into the main query. For a single outer join, and other simple cases, this view merging happens smoothly. But in other cases it does not, and the Optimizer is left with some complex SQL with the joins being performed within the inline views. And this restricts the options available to it when building the final execution plan.

There is more information on this at the Oracle Optimizer Blog - Outer Joins in Oracle

"In Oracle, ANSI left and right outerjoins are internally expressed in terms of left outerjoined lateral views. In many cases, a left outerjoined lateral view can be merged and the ANSI left (or right) outerjoin can be expressed entirely in terms of Oracle native left outerjoin operator. (A lateral view is an inline view that contains correlation referring to other tables that precede it in the FROM clause.)"

So for a simple single outer join such as the following:

SELECT T1.d, T2.c
FROM T1 LEFT OUTER JOIN T2 ON (T1.x = T2.x);

it first becomes this on initial conversion to a lateral inline view:

SELECT T1.d, LV.c
FROM T1,  LATERAL (SELECT T2.C FROM T2 WHERE T1.x = T2.x)(+) LV;

which then becomes this when the view is merged:

SELECT T1.d, T2.c
FROM T1, T2
WHERE T1.x = T2.x (+) ;

However, if the joins are more complex, or there are other issues e.g. NULL values allowed within indexed columns, then it may not be possible to collapse the lateral views back into a single level query. In which case the options then open to the Optimizer are less than they would be otherwise, as it must treat the unmerged views on their own.

So if you have a query involving OUTER JOIN's that is not getting as good an execution plan as you think it could get, it may be due to you having used ANSI join syntax. And when Oracle rewrites this to its native syntax it is not able to merge back together the inline views for one reason or another. In my case it seems to have been due to the fact that almost all columns allowed NULL values to be stored in them (except the primary key columns), and NULL values are not stored within indexes on disk. As a result, Oracle was not able to utilise this extra index when more tables were added to the query that were also outer joined.

Wednesday, 20 June 2012

NULLs & Multi-Column Indexes

Previously I showed some of the issues with allowing NULL values in columns and how indexes might be ignored by the Optimizer. All of the examples I gave involved single column indexes only, and showed that NULL values are not stored in such an index. As others have pointed out in comments, this is not necessarily true for multi-column indexes. If at least one of the columns in the index does not allow NULL values then an index entry is stored with the values of all of the indexed columns. This means that a NULL value can be stored in a multi-column index. Lets test that through an example.

Previously we saw that if we had an index on the ref_id_n column in the transaction table, which allowed NULL values, then the Optimizer would not use the index for an "IS NULL" constraint:

SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)

The index on ref_id_n is being ignored, as the NULL values are not stored in that index. However, if we create another index, with ref_id_n as the first column followed by another column that does not allow NULL values, then the index will contain entries where ref_id_n is NULL. With such an index, the Optimizer can now use it for an "IS NULL" constraint:

SQL> create index ix2_transaction on transaction (ref_id_n, ref_id_nn) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null ;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  8tubzdty7vdnv, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 176942238

-------------------------------------------------------------------------------------
| Id  | Operation         | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                 |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE   |                 |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX2_TRANSACTION |     1 |     4 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N" IS NULL)

So multi-column indexes can store NULL values in them, providing at least one column does not allow NULL values. The benefit of such an index can be significant - in this test case the cost came down from 3,578 to 2, simply because the number of NULL rows were so few. But an index on only the ref_id_n column itself is of no use for this query, and is ignored by the Optimizer.

Potentially you can also gain some benefit from an index where the ref_id_n column is not the leading index, as the index may be smaller in size than the table, and the Optimizer may chose an Index Full Scan rather than a Full Table Scan. And that is the case with the test data set I have been using:

SQL> drop index ix2_transaction ;

Index dropped.

SQL> create index ix2_transaction on transaction (ref_id_nn, ref_id_n) ;

Index created.

SQL> select count (*) from transaction where ref_id_n is null;

  COUNT(*)
----------
       100

SQL> @xplastexec

SQL_ID  2j64r2n1nq4xm, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 1095380460

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |       |       |   726 (100)|          |
|   1 |  SORT AGGREGATE       |                 |     1 |     4 |            |          |
|*  2 |   INDEX FAST FULL SCAN| IX2_TRANSACTION |     1 |     4 |   726   (1)| 00:00:09 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)

The Index Fast Full Scan is costed at 726, compared to the 3,578 of the Full Table Scan. Being a "count (*)" only, no other data columns are needed from the table itself.

So if you are allowing NULL values in columns within your database design, and you want to find those rows that have a NULL value stored, then you cannot use an index on just that column alone. You will need a multi-column index, and include another column that does not allow NULL values. However, it may be that the only column that does not allow NULL values in your database design is the primary key column(s) itself, if you simply allow NULLs for every column by default.

Wednesday, 13 June 2012

NULLs and Indexes

It is quite clear that Oracle does not store NULL values within indexes (B-Tree indexes specifically, being the most common type used within Oracle). This has been the case for many years - I remember reading about this in the version 7 days over 15 years ago. My understanding was that this was an optimization on Oracle's part, as they believed that NULL values would not be used in most queries, and that storing NULL values in an index might skew the index somehow if there were a great many number of NULL values in rows. There are several consequences of this non-storage of NULL values in an index, as I mentioned in a previous post on NULL values. But it seems that some people are still unaware of this issue, believing that NULL's are stored within indexes, and that as a result all indexes are equal. Not having any hard evidence to refer people to, I thought I would invent some simple tests to show that NULL's are not stored in indexes, and the various side effects of this. Also, assumptions can be dangerous things when taken too far, and the memory plays tricks over what I once read in the past. So providing some real test cases would either verify any assumptions I had made, or show up them as being wrong.

Tests Overview

I'll create two simple tables - one as a master table of codes (reference with 10,000 rows), and another larger table that refers to the master table as a foreign key (transaction with 1,000,000 rows). This means that there are 100 rows per Reference Identifier value in Transaction. The Transaction table will have two such foreign key columns - one without NULL's (ref_id_nn), and one with NULL's (ref_id_n).

Indexes will be created on these columns, and some example queries run using either reference column to show whether the indexes are being used or not. We can also look at the statistics on the indexes themselves to tell us something about whether NULL values are stored in them or not. I provide all the SQL to create and populate these two tables at the end of this post - I assume most people are interested in the results first.

Index Statistics

Immediately we can see that NULL values are not stored in an index by looking at the statistics for the indexes on the Transaction table.

col index_name heading 'Index'     format a20
col lb heading 'Leaf|Blocks'       format 999,999     
col dk heading 'Distinct|Keys'     format 999,999,999 
col cf heading 'Clustering|Factor' format 999,999,999 
col nr heading 'Num Rows'          format 999,999,999 
--
select i.index_name,
       i.leaf_blocks lb,
       i.num_rows nr,
       i.distinct_keys dk,
       i.clustering_factor cf
  from user_ind_statistics i
 where i.table_name = 'TRANSACTION'
/

The results of this query are:

                         Leaf                  Distinct   Clustering
Index                  Blocks     Num Rows         Keys       Factor
-------------------- -------- ------------ ------------ ------------
PK_TRANSACTION          1,875    1,000,000    1,000,000       13,147
IX1_TRANSACTION_NN      2,090    1,000,000       10,000    1,000,000
IX1_TRANSACTION_N       2,090      999,900        9,999      999,900

Look at the "Number of Rows" values and the "Number of Distinct Keys" values for the two indexes on the two foreign key columns. The index on the column allowing NULL's has one less distinct key value - because the NULL value has not been stored in the index. Whereas the index on the column without any NULL's has the full 10,000 distinct key values in it. This is also reflected in the number of rows covered by the index - 100 less for the index with NULL values than for the index without NULL values. So already we have evidence that NULL values are not stored within a B-Tree index i.e. they are ignored by the index itself.

Queries

In my previous blog post I made two claims about indexes on columns allowing NULL values not being used:

Such an index cannot be used to satisfy an "IS NULL" query constraint
The index also cannot be used to satisfy a "column != value" query constraint

Lets test these claims. In principle we want to run the same query against each of the two foreign key columns, and see whether the corresponding index is used. However, this may not be possible for one reason or another, as I'll explain in each case.

"IS NULL" Constraint

The query is:

select count (*) from transaction where ref_id_n is null ;

When executed the query has the following execution plan (as from dbms_xplan.display_cursor immediately after executing the query):

SQL_ID  cwm2cmgn8q09n, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3578 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |     1 |     4 |  3578   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("REF_ID_N" IS NULL)

So the index was ignored and a full table scan was done, even though there were only 100 rows to retrieve.

Unfortunately we cannot run this query using the other column, because it does not allow NULL values. So the Optimizer will know that this query cannot return any rows at all. (It actually did an Index Fast Full Scan when I ran it, and returned a count of 0 rows). Instead we can change the query to compare to a real value, rather than NULL, and see that it uses the index for the execution now, even though it is still 100 rows again from the 1,000,000 in the table.

SQL_ID  7burxn278qj8b, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5

Plan hash value: 1807025728

---------------------------------------------------------------------------------------
| Id  | Operation         | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                   |       |       |     3 (100)|          |
|   1 |  SORT AGGREGATE   |                   |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| IX1_TRANSACTION_N |   101 |   404 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("REF_ID_N"=5)

This shows that an index on a column allowing NULL values cannot be used when executing a query with an "IS NULL" constraint on that column.

"!=" Constraint

I've since realised that the Optimizer would be more likely to choose a Full Table Scan whether the column allowed NULL or not for a "not equals" constraint, simply because of the number of rows being returned. If 100 rows match a value, then 999,900 do not match that value. And a Full Table Scan will be the optimal access method regardless of whether the column allows NULL values or not.

However, if the query only did a "count (*)" and did not retrieve any data columns from the table, then potentially an index could be used - with the Index Fast Full Scan access method. Consider the following performed on each of the two foreign key columns:

select count (*) from transaction where ref_id_nn != 5 ;

When I execute this I get 999,900 for the non-NULL column, and 999,800 for the NULL allowed column. Whoops! Something is not right somewhere. Again, we are back to the point that NULL is not a "normal value" within the value space of the data type of the column. And a NULL value is treated differently by Oracle rather than just another value. So although NULL is never equal to another real value, it is also never not equal to another real value. Even if I force a full table scan via the "full (transaction)" hint, I still get a count of 999,800 for the "ref_id_n != 5" constraint. (The execution plan from dbms_xplan.display_cursor showed that "TABLE ACCESS FULL" was used on TRANSACTION). [This very point about "!=" was also mentioned in a comment by Narenda on my previous post, and I had independently stumbled upon this as a result of this series of tests].

So even though we have not been able to show one of my claims about NULL values and indexes, we have instead stumbled upon another issue with NULL values. A NULL value is never equal to a real value, and also never not equal to a real value. I am sure that I have seen Jonathan Lewis discuss this in a presentation when talking about rewriting SQL statements into their equivalents. If a column allows NULL values then you must be careful when rewriting equality constraints into equivalent inequality constraints.

Further testing threw up another anomaly. Rather than test "!=" I thought I would test "not in", but contrived so that there is only one value in the "not in" list. For the NULL allowed column the Optimizer chose a full table scan:

SQL> select count (*) from transaction where ref_id_n not in (select 5 from dual) ;

  COUNT(*)
----------
    999800

SQL> @xplastexec

SQL_ID  0f3nh5h11yu6p, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n not in (select 5 from dual)

Plan hash value: 297680891

-----------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |       |       | 23425 (100)|          |
|   1 |  SORT AGGREGATE     |             |     1 |     4 |            |          |
|*  2 |   FILTER            |             |       |       |            |          |
|   3 |    TABLE ACCESS FULL| TRANSACTION |   998K|  3900K|  3578   (1)| 00:00:43 |
|*  4 |    FILTER           |             |       |       |            |          |
|   5 |     FAST DUAL       |             |     1 |       |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))

While for the non-NULL column it used the index:

SQL> select count (*) from transaction where ref_id_nn not in (select 5 from dual) ;

  COUNT(*)
----------
    999900

SQL> @xplastexec

SQL_ID  fw2svuam6vzb1, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn not in (select 5 from dual)

Plan hash value: 1550919231

---------------------------------------------------------------------------------------------
| Id  | Operation              | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                    |       |       | 20422 (100)|          |
|   1 |  SORT AGGREGATE        |                    |     1 |     4 |            |          |
|*  2 |   FILTER               |                    |       |       |            |          |
|   3 |    INDEX FAST FULL SCAN| IX1_TRANSACTION_NN |   998K|  3900K|   575   (2)| 00:00:07 |
|*  4 |    FILTER              |                    |       |       |            |          |
|   5 |     FAST DUAL          |                    |     1 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NULL)
   4 - filter(LNNVL(:B1<>5))

So I think that this example does show that there are times when the Optimizer will not use an index on a column allowing NULL values, but would otherwise use an index if NULL's were not allowed.

NULL as "Anything"

Another pattern I have seen is where a NULL value is used to mean "any value", rather than "no value". So in a system that involves processing transactions according to type, and each type is processed by a different handler, then a NULL value means that the transaction can be processed by any available handler. This leads to a query similar to the following:

select ... from transaction where (ref_id_n = 5 or ref_id_n is NULL) ...

The execution plan for this cannot use the index because NULL values are not stored in it. Hence you get an execution plan of

SQL_ID  4tsn8vbt3s3gh, child number 0
-------------------------------------
select count (*) from transaction where ref_id_n = 5 or ref_id_n is null

Plan hash value: 185056574

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |  3580 (100)|          |
|   1 |  SORT AGGREGATE    |             |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION |   101 |   404 |  3580   (1)| 00:00:43 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("REF_ID_N"=5 OR "REF_ID_N" IS NULL))

However, if the database design was different, and the value '0' used as a special case for "any handler", with a corresponding row inserted into the "reference" table, then the query could be changed to the following and Oracle would use the index on the column:

SQL_ID  d2v9fgscm8nrq, child number 0
-------------------------------------
select count (*) from transaction where ref_id_nn = 5 or ref_id_nn = 0

Plan hash value: 1215059433

-----------------------------------------------------------------------------------------
| Id  | Operation          | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                    |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                    |     1 |     4 |            |          |
|   2 |   INLIST ITERATOR  |                    |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| IX1_TRANSACTION_NN |   201 |   804 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(("REF_ID_NN"=0 OR "REF_ID_NN"=5))

Again, the presence of NULL values results in sub-optimal query execution, and a different design can provide a better solution, allowing the Optimizer to use an index for much faster access (cost of 4 versus 3580).

Conclusion

For now I think I have written enough in one post about NULL's and indexes. My main point was to provide some real evidence that NULL values are not stored in B-Tree indexes in Oracle, and that if NULL values are allowed in columns it can affect whether the Optimizer will use an index or not. And I think I have done that. I'm sure that people experienced with Oracle already know this, but I just wanted to provide some proof points for anybody who doubted it for any reason. As ever, the benefit of some real tests is that you get to either verify any assumptions you have made or show them up to be wrong and correct them. In this case I have been able to correct some weak assumptions I had, and have learnt some more about how the Oracle Optimizer works, specifically when handling NULL values.

Table Creation SQL

Here is the SQL to create the tables, populate them with data and index them. Note the two foreign key columns in "transaction", and that one value is mapped to NULL when generating the data.

--
-- Create the tables necessary for the NULL values investigation
-- Create tables, load in generated data, index, gather statistics
--
prompt Creating tables
--
-- Reference table - a lookup of master values
--
create table reference (
ref_id number (*,0) not null,
group_id number (*,0) not null,
description varchar2 (512) not null,
constraint pk_reference primary key (ref_id)
) ;
--
-- Transaction table
--
create table transaction (
trans_id number (*,0) not null,
ref_id_nn number (*,0) not null,
ref_id_n number (*,0) ,
location_id number (*,0) not null,
padding varchar2 (512),
constraint pk_transaction primary key (trans_id)
) ;
--
alter table transaction add constraint fk_ref_nn 
foreign key (ref_id_nn) references reference (ref_id) ;
--
alter table transaction add constraint fk_ref_n
foreign key (ref_id_n) references reference (ref_id) ;
--
prompt Loading data
--
-- SPECIAL DUMMY ROW!! Keep in sync with any newly added columns!*!
insert into reference values (0, 0, 'Unknown - Dummy entry for referential integrity purposes') ;
--
insert into reference
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 1000),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' || to_char (r, '00000000')
  from generator g
where g.r <= 10000
/
commit ;
--
insert into transaction
with generator as 
(select rownum r
          from (select rownum r from dual connect by rownum <= 1000) a,
               (select rownum r from dual connect by rownum <= 1000) b,
               (select rownum r from dual connect by rownum <= 1000) c
         where rownum <= 1000000
       ) 
select r, 
    mod (r, 10000),
    decode (mod (r, 10000), 0, null, mod (r, 10000) + 1),
    mod (r, 1999),
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' || to_char (r, '000000000')
  from generator g
where g.r <= 1000000
/
commit ;
--
prompt Creating indexes
--
create index ix1_transaction_nn on transaction (ref_id_nn) ;
create index ix1_transaction_n  on transaction (ref_id_n) ;
--
prompt Gathering Statistics
--
exec dbms_stats.gather_schema_stats ('JB') 
--

So there are 10,000 rows in Reference, and 1,000,000 rows in Transaction. This means that there are 100 rows per Reference Identifier value in Transaction.

Sunday, 10 June 2012

Avoiding NULLs

On the one hand I understand the added value of the concept of a "NULL" value - meaning that no value has been stored in a column in a row in a database table. On the other hand I have always felt that allowing NULL values in a column leads to more trouble of one form or another down the line. And recent experience with another database design has again reinforced my concerns. For another description about some of the problems with NULL values and what they do or do not mean, you can also read Much Ado About Nothing?, which is quite informative.

If I had to summarise my position on allowing NULL values, then I think of them similar to how I think of a loaded gun - not directly harmful itself, but a there is a reasonable chance of it going wrong and resulting in serious damages. If you need to use them, and you have a clear reason why, then use them. But otherwise avoid them as much as possible.

Allowing NULL values in a column can be bad for all kind of direct and indirect reasons.

Oracle does not store NULL values in normal B-Tree indexes, which limits when the indexes can be used on columns that allow NULL values
- Such an index cannot be used to satisfy an "IS NULL" query constraint
- The index also cannot be used to satisfy a "column != value" query constraint
Foreign keys that allow NULL are "optional", but again have index and optimizer side effects.
Outer joins may become necessary, which again limit the choices available to the Optimizer.
Many programming languages have no concept of NULL, being a value outside of the normal value range by definition.

Either NULL values require special programming, or require some kind of shadow boolean variable to indicate if it is really NULL (no real value in the main variable).

Many programmers struggle with the difference between NULL and empty or blank, which can all mean different things within a programming language e.g. a zero length string, or the number zero.
NULL cannot be using in normal equality and range tests. As it is outside the normal value range it cannot be greater than or less than a real value. And it cannot be equal to any other value in a column, including another NULL. You must use the special constraint "IS NULL" to test for a NULL value, and never "= NULL".
Calculations involving a NULL value evaluate to NULL i.e. unknown.

So NULLs are definitely useful, but each case should be considered carefully before using them. Generally it is easier to not allow NULL values in columns, as all operations are easier. Allowing NULL values requires some thought to make sure that it ends up working as you hoped it would.

Thursday, 31 May 2012

What Is Performance?

Just what is Performance? I don't mean a formal definition of Performance - through it would probably involve things like measurements and elapsed time and resources used and units of work completed. And I don't mean Performance Tuning - which is a reactive action to a perceived problem. I'm on about how you design Performance into a new application or database. As a proactive action when designing and building a new database based application. What do you do differently compared to if you did not care about Performance and simply ignored it?

I believe a "Performance Based Design" has to think about how the application will be used. One of my personal sayings about performance is that computer systems and applications do absolutely nothing until end users submit some work for them to do. The application sits there waiting for another request to be submitted. So the performance of the application totally depends on the work the application is trying to do, which in turn depends on what the end users are doing. I don't believe Performance can exist in isolation from an Application, much like Data without a Program to process it is not much use to anyone.

Which means that designing with Performance in mind is really about clearly documenting how the application will be used, so that those use cases can feed into the final design, and be used to review the application once implemented. From a database point of view, this not only means finding out all the data items and the relationships between them - fairly standard database design type stuff - but also establishing which data items will be accessed more frequently than others.

Generally I agree with the argument about "premature performance optimization", where someone tries to immediately optimize part of the first design for best possible performance, without knowing whether it is a critical part of the application or not. A good design will be good enough (Agile principles too I believe), while being deliberately designed to allow changes in the future. Performance bottlenecks can be redesigned and rewritten when they are identified as real bottlenecks, and not before.

Keeping your options open within your design is important in being able to achieve good performance. Inflexible designs do not leave any room for improvement, without major modifications and rewrites.

Instrumenting the application is another very important thing to factor in. Building in performance measurement to your application will set you up to be able to identify any bottlenecks, and to turn nebulous "slowness" as perceived by end users into real numbers - both before and after any performance improvements you have made. It can also provide an audit trail of performance over time and can be used to show whether response time is indeed increasing, and whether the workload of transactions submitted has also increased.