Thursday 9 June 2011

Customers who won't wait to be helped

[I'm back. I haven't posted for a long while due to a combination of being busy and somewhat losing focus on the blogging side, but I hope to post more frequently in the future. I was going to start with a post on histograms in Oracle but ...]

I work for a software house, whose heavyweight application runs on top of Oracle. Earlier this week I was dealing with a customer who had poor performance on just one SQL query after having updated their statistics. I spent a day analysing the SQL and the data - what was it trying to do and why - and another day trying different solutions to make it use a different execution plan and run faster for them. I then sent the results to the support people dealing with this customer so that the solution could be implemented. Imagine my surprise then when I see today that Jonathan Lewis has a blog post about the same SQL statement, and makes reference to a post on one of the Oracle Technical Forums.

I'm surprised that a customer would post SQL from a third party application on the web like that, and that they would somehow expect a better answer to come back than from the software vendor themselves. I was even more surprised because they were asking for a set of hints to change the execution plan, and I had just come up with such a set of hints just 2 days earlier. Okay, maybe this had not been forwarded to the customer yet, because our Support department was doing further testing on it first. So it is highly likely that the customer had not seen my solution yet. But I'm still amazed that a customer would be willing to put into production some recommendation they had got off a web forum, and on the basis that they could not work out a better solution themselves - the poster did not know how to do hints properly.

In terms of the problem itself, the cause of the poor performance is actually the extreme skew within the data in this table. On average most of the date values occur thousands of times, but the particular value used in the query only occurs about ten times in the table. Hence another execution plan is faster for this particular value.

Which brings me back to histograms, which I was planning on doing a completely different post on anyway. There is a histogram on the constrained column, but it is a height balanced histogram because there are over 254 distinct values, and so there are no statistics on row counts of individual values other than the most popular values. The value used in the query is unpopular, and the average number of rows per value across all the unpopular values is in the tens of thousands. Hence the choice of an execution plan using a HASH JOIN, as that scales in a different way to higher row counts than a NESTED LOOP join does.

Maybe Oracle should introduce a new type of histogram? One that records both the most popular values, and the least popular values, and then stores a single average for all the values in between (the density). That would have helped here. It certainly seems the case that Oracle does not handle heavily skewed data sets well, though of course you should never expect a "one size fits all" solution to be perfect for all possible scenarios. What the Optimizer does is try and produce an execution plan that is the best for most of the possible input values. Which is what it did here, based on the statistics available for the average number of rows per date value via the density statistic.

Another viable solution is to make the Optimizer believe that the date constraint will match far fewer rows and so choose the NESTED LOOP join method itself. This is the approach put forward by Wolfgang Breitling in his paper on Tuning by Cardinality Feedback, in which he suggests changing the Density statistic stored for the column. And indeed reducing the density value by a factor of 100 has this effect. The upside of this approach is that it avoids the need to change the SQL or use hints, which do not always work as intended.

No comments: