I've been working with a client who has a Java based application, which through various layers of Classes and Objects ends up issuing SQL statements against Oracle 10g through JDBC. Up to now the client has been happy with the application, in spite of some of the hoops they have had to jump through to get it to work. Finding data abstraction layers that work together and can be used in production systems has been an issue for them - Open Source legal liabilities etc. But one way or another they have managed to assemble a set of Java based components that they have integrated together into their application.
They have been live with a new module for 3 months now, and reported that elapsed times on the module were gradually getting longer and longer. It is a batch job that takes transactions posted into the system today, and updates various counts and totals in other data sets. Elapsed times had gone from an initial 2 hours to 10 hours only 2 months later, which was far too long.
One of the first weird things was that the client could not believe that their application ran so slow, because they had done lots of testing before going live with it. And their testing had used similar data volumes to their production environment. More on that later.
Looking at some AWR reports I was quickly able to establish that the elapsed time was down to one UPDATE statement. The client knew this too, and could not tell why the UPDATE was so slow. It was using bind variables throughout, with a very simple WHERE clause. There were no sub-queries or joins, it was using the main primary index on the table, and was not doing any disk I/O (physical reads).
When they next ran that module of the application, I monitored V$SQL until I saw that UPDATE, and then did a DBMS_XPLAN.DISPLAY_CURSOR on it using its SQL_ID and CHILD_NUMBER. The execution plan I got did indeed saying it was using an index, but a SKIP SCAN on it. This means that the index is being used inefficiently, and not as efficiently as it could be. Something was stopping Oracle using all of the index - it has 9 columns in it.
Further examination showed that the WHERE clause in the UPDATE did specify all columns in the unique index it was using. So it should have been doing a UNIQUE SCAN. Checking the execution plan again, I spotted that there was a reference to one column within a call to an INTERNAL_FUNCTION. This got me thinking. The column in question was a DATE column, and was the second column in the 9 column unique index.
Ah-ha. Something is wrong with the DATE value provided by the application at run time, and Oracle cannot use it directly to compare to the DATE values in the database. Instead Oracle is converting the value in each record in the database to something else, and then comparing it to the value in the query. With the net effect that it is not using all of the index. Thus, instead of doing a UNIQUE SCAN to find one single record, it was instead finding all records with the same value for the first column, and then comparing each of these to the converted DATE value. And in some cases one value of the first column could have many hundreds or thousands of matching data rows.
Why was the query gradually getting slower each day and week? Because the DATE column referred to the date of the business transaction itself, and sometimes they had to process late transactions from several days ago - hence the UPDATE of an existing record rather than an INSERT. And over time, more and more records were being added for different DATE values. So the UPDATEs were having to retrieve and check more records each week that went by.
What about the testing done prior to production? They had assumed that there would be very few late, back dated transactions, and had only tested new transactions for the past day. Thus all the processing involved INSERTs only into the table. Any late transactions in the test data were negligible in volume. However, in production up to 50% of the transactions were late. And the UPDATE was taking 100 times longer to execute than the INSERT.
What caused this? Digging further I established that Oracle saw the value it was getting was of type TIMESTAMP, which has sub-second values in it. And rather than lose precision by truncating its value, it was instead converting the DATE value in each data row to a TIMESTAMP value before comparing.
And some further digging shows that this is a known issue with Oracle and Java applications. See their own FAQ at:
http://www.oracle.com/technology/tech/java/sqlj_jdbc/htdocs/jdbc_faq.html#08_01
Most of the Java Date types (Timestamp or Util.Date) are considered by Oracle to be Timestamps, and the JDBC layer passes such values through as Timestamp. Hence the conversion by Oracle, and the lack of use of the index. What is clear from reading the documentation is that Oracle got this mapping wrong. Instead of mapping the Oracle DATE to only the Java sql.Date type, they should also have allowed a mapping from Java Timestamp to Oracle DATE. Which is why they have changed the behaviour in the Oracle 11.1 JDBC driver. With this correct mapping, no conversion is done, the Java supplied query value is used directly, and the index is used properly for a UNIQUE SCAN.
A solution to this is tricky, as the customer is on Oracle 10g, and not planning on moving to 11g at the moment. Why? Well if it was simple there would already be a solution published by Oracle and people would be using it. And there isn't - see the FAQ referenced before. The database is not new, and the Java module is something new added on to an existing application suite. So switching all columns in the database from DATE to TIMESTAMP is not an option. Using sql.Date in the Java application is not an option, because it has a zero value for its time part i.e. no time part. So something else needs to be done for the proper solution.
One option is to change the SQL being used in the UPDATE and in the WHERE clause instead of:
- date_column = :bind_variable
- date_column = cast (:bind_variable as date)
This way the value will always be a date, and the UPDATE will use the full index. However, the customer is worried about other such queries that may be lurking within their application code. They want a generic solution that just works for all DATE columns. Which means a change to the Java data type mapping somewhere, somehow.
While the customer works out how to implement their own change to their Java code (at which abstraction layer of all of the layers involved in mapping an application Object down to something that calls JDBC do they make a change?) they needed a quick fix to reduce the elapsed time of this job. The solution? Create an index with the same columns, but with the DATE column last in it. Yes, it takes time to create the index, and some disk space is used up. But the UPDATE now runs a lot faster, because it uses more of the index structure with the other 8 columns in it, and ends up with far fewer data rows to check against the DATE value.
The net result? The 10 hour job has come down to 2.5 hours, putting the customer back inside their 8 hour window. And it will probably remain at similar levels for at least the next month or so. Giving the customer time to work out how to fix their Java data type mappings.
As I said at the beginning, using Java often can cause all kinds of other problems elsewhere. Even when you are using it as the Java documentation tells you to. And isn't it amazing how the best tests in the world always seem to miss something that turns out to be more important and relevant than the other tests they concentrated on? It seems that the Testing Team never actually took the time and effort to establish exactly what mix of business transactions would be processed by the application. They just made some assumptions about what it would be.