Thursday, 29 January 2009

Java Data Types and Oracle Index Usage

I'm not a great fan of Java. I'm not saying I hate it or anything, just that I've never exactly fallen in love with it. While it offers some potential advantages as an Object Oriented Language over others, it also suffers from a lot of complexity in trying to get it to do anything useful that you want it to do. One way or another I keep running into issues with Java, of one form or another.

I've been working with a client who has a Java based application, which through various layers of Classes and Objects ends up issuing SQL statements against Oracle 10g through JDBC. Up to now the client has been happy with the application, in spite of some of the hoops they have had to jump through to get it to work. Finding data abstraction layers that work together and can be used in production systems has been an issue for them - Open Source legal liabilities etc. But one way or another they have managed to assemble a set of Java based components that they have integrated together into their application.

They have been live with a new module for 3 months now, and reported that elapsed times on the module were gradually getting longer and longer. It is a batch job that takes transactions posted into the system today, and updates various counts and totals in other data sets. Elapsed times had gone from an initial 2 hours to 10 hours only 2 months later, which was far too long.

One of the first weird things was that the client could not believe that their application ran so slow, because they had done lots of testing before going live with it. And their testing had used similar data volumes to their production environment. More on that later.

Looking at some AWR reports I was quickly able to establish that the elapsed time was down to one UPDATE statement. The client knew this too, and could not tell why the UPDATE was so slow. It was using bind variables throughout, with a very simple WHERE clause. There were no sub-queries or joins, it was using the main primary index on the table, and was not doing any disk I/O (physical reads).

When they next ran that module of the application, I monitored V$SQL until I saw that UPDATE, and then did a DBMS_XPLAN.DISPLAY_CURSOR on it using its SQL_ID and CHILD_NUMBER. The execution plan I got did indeed saying it was using an index, but a SKIP SCAN on it. This means that the index is being used inefficiently, and not as efficiently as it could be. Something was stopping Oracle using all of the index - it has 9 columns in it.

Further examination showed that the WHERE clause in the UPDATE did specify all columns in the unique index it was using. So it should have been doing a UNIQUE SCAN. Checking the execution plan again, I spotted that there was a reference to one column within a call to an INTERNAL_FUNCTION. This got me thinking. The column in question was a DATE column, and was the second column in the 9 column unique index.

Ah-ha. Something is wrong with the DATE value provided by the application at run time, and Oracle cannot use it directly to compare to the DATE values in the database. Instead Oracle is converting the value in each record in the database to something else, and then comparing it to the value in the query. With the net effect that it is not using all of the index. Thus, instead of doing a UNIQUE SCAN to find one single record, it was instead finding all records with the same value for the first column, and then comparing each of these to the converted DATE value. And in some cases one value of the first column could have many hundreds or thousands of matching data rows.

Why was the query gradually getting slower each day and week? Because the DATE column referred to the date of the business transaction itself, and sometimes they had to process late transactions from several days ago - hence the UPDATE of an existing record rather than an INSERT. And over time, more and more records were being added for different DATE values. So the UPDATEs were having to retrieve and check more records each week that went by.

What about the testing done prior to production? They had assumed that there would be very few late, back dated transactions, and had only tested new transactions for the past day. Thus all the processing involved INSERTs only into the table. Any late transactions in the test data were negligible in volume. However, in production up to 50% of the transactions were late. And the UPDATE was taking 100 times longer to execute than the INSERT.

What caused this? Digging further I established that Oracle saw the value it was getting was of type TIMESTAMP, which has sub-second values in it. And rather than lose precision by truncating its value, it was instead converting the DATE value in each data row to a TIMESTAMP value before comparing.

And some further digging shows that this is a known issue with Oracle and Java applications. See their own FAQ at:

Most of the Java Date types (Timestamp or Util.Date) are considered by Oracle to be Timestamps, and the JDBC layer passes such values through as Timestamp. Hence the conversion by Oracle, and the lack of use of the index. What is clear from reading the documentation is that Oracle got this mapping wrong. Instead of mapping the Oracle DATE to only the Java sql.Date type, they should also have allowed a mapping from Java Timestamp to Oracle DATE. Which is why they have changed the behaviour in the Oracle 11.1 JDBC driver. With this correct mapping, no conversion is done, the Java supplied query value is used directly, and the index is used properly for a UNIQUE SCAN.

A solution to this is tricky, as the customer is on Oracle 10g, and not planning on moving to 11g at the moment. Why? Well if it was simple there would already be a solution published by Oracle and people would be using it. And there isn't - see the FAQ referenced before. The database is not new, and the Java module is something new added on to an existing application suite. So switching all columns in the database from DATE to TIMESTAMP is not an option. Using sql.Date in the Java application is not an option, because it has a zero value for its time part i.e. no time part. So something else needs to be done for the proper solution.

One option is to change the SQL being used in the UPDATE and in the WHERE clause instead of:
  • date_column = :bind_variable
  • date_column = cast (:bind_variable as date)

This way the value will always be a date, and the UPDATE will use the full index. However, the customer is worried about other such queries that may be lurking within their application code. They want a generic solution that just works for all DATE columns. Which means a change to the Java data type mapping somewhere, somehow.

While the customer works out how to implement their own change to their Java code (at which abstraction layer of all of the layers involved in mapping an application Object down to something that calls JDBC do they make a change?) they needed a quick fix to reduce the elapsed time of this job. The solution? Create an index with the same columns, but with the DATE column last in it. Yes, it takes time to create the index, and some disk space is used up. But the UPDATE now runs a lot faster, because it uses more of the index structure with the other 8 columns in it, and ends up with far fewer data rows to check against the DATE value.

The net result? The 10 hour job has come down to 2.5 hours, putting the customer back inside their 8 hour window. And it will probably remain at similar levels for at least the next month or so. Giving the customer time to work out how to fix their Java data type mappings.

As I said at the beginning, using Java often can cause all kinds of other problems elsewhere. Even when you are using it as the Java documentation tells you to. And isn't it amazing how the best tests in the world always seem to miss something that turns out to be more important and relevant than the other tests they concentrated on? It seems that the Testing Team never actually took the time and effort to establish exactly what mix of business transactions would be processed by the application. They just made some assumptions about what it would be.

Monday, 26 January 2009

Queuing Theory & Resource Utilisation

While Queuing Theory can be quite academic and mathematical at times, it does include a number of core rules or laws about how systems and their components behave under higher levels of utilization and the effect on the length of queues of pending requests.

One of the conclusions from this is that you cannot have high utilization of any resource without some level of queuing occurring. It may not seem obvious, but it is true. Essentially the queue of pending requests or jobs is needed to keep the resource busy most of the time. If the queue was ever empty then the utilization of the resource would be very low. High utilization is only achieved by having a queue of waiting requests, so that there is always a next thing to do when the current thing completes.

You could theoretically also achieve 100% CPU utilization by only having a few very active processes and no queues. So on a 4 CPU system you could achieve 100% CPU utilization with only 4 processes, each always busy and executing instructions. In this scenario there is no queuing for the CPUs, as there are only 4 processes, and the efficiency of the system is very good.

However, such a scenario is very, very unlikely. The processes could never block for anything and would always be executing CPU instructions. Which means that they would not be doing any disk I/Os, or communicating with each other by some kind of messages, or using locks or latches on shared structures. Which is the complete opposite of the Oracle Database Architecture.

An Oracle database server will have many processes running on it - one shadow server process per connected session (unless you are using the Shared Server configuration) - and they will use shared memory, and locks and latches to control data consistency, and perform lots of disk I/Os.

So an Oracle database server does conform to the general model in Queuing Theory of having lots of separate clients (the shadow servers) making requests on the resources in the system. And as a result, it does conform to the golden rule of high resource utilisation equals queues of pending requests.

As a result, 100% CPU utilization is very bad, and is symptomatic of very large queues of waiting processes. Queuing Theory also shows that above 50% utilization of a resource, there is always a request in the queue more often than not. Note that this is ‘on average’ - sometimes the queue can be empty and sometimes it can have several requests in it - but on average the number of waiting requests will be more than zero.

A general rule of thumb is to get worried at 80% utilization, as the number of concurrent requests will average something around four, and rises exponentially above this. An explanation of some of this and a nice graph of queue length versus utilization is available in this Microsoft article on Modeling Principles for Sizing. I know it is a Microsoft article, but it does summarize the effects of queuing well, and has the nice graph in it.

These queues can apply to all aspects of a computer system - CPU, Disk, Network and Memory. To drive any one of these at 80% utilisation or above means that you have queues of pending requests, which are needed to keep the utilisation that high.

The net effect of such high utilisation is an increase in response time. The total elapsed time for a given request is now the wait time in the queue plus the service time of the request itself. When the queue is 4 long on average, then the response time is actually 5 times the service time e.g. you might spend 40 ms waiting to perform a disk I/O of 10 ms itself. So high utilisation and high queues has a direct effect on the elapsed time of individual transactions.

The formula for the Total Number of jobs in the system (N) in proportion to Utilisation (U) is:
  • N = U / (1 - U)
Note that N is not the same as Q - the number of requests waiting in the queue. N is both the number in the Queue plus any requests currently being serviced - the total within the system.

And the formula for Response Time is:
  • R = Service Time / (1 - U).

So higher utilisation directly leads to longer queues, and larger response times.

Whether this is acceptable to your situation depends whether your goal is total throughput irrespective of transaction response time, or your goal is individual transaction response time. In other words, is it an online system, or more of a batch system processing relatively large units of work.

While Queuing Theory can seem theoretical at times, to me it reinforces the message that when you scale up the load on any system there will always be a bottleneck. And that bottleneck will reach high utilisation as it nears its capacity, and large queues will form in front of it. Identifying where the queues are in a system, what the bottleneck is, and what can be done to fix it - reduce service time or increase capacity - are key to performance tuning. And an appreciation of Queuing Theory has helped me get a deeper understanding of this area.