Tuesday 20 September 2011

Blowing my own trumpet - #1

On the one hand I don't want to boast, but on the other hand there are a couple of things I am proud of at the moment and thought I would take the opportunity to share. If only to record them for posterity.

First, I am presenting at the UK Oracle User Group UNIX SIG in a few days time (Thu 22 Sept) on Queueing Theory. I thought I would volunteer for something in the spirit of giving back, and was pleasantly surprised when they accepted my submission. Nothing revolutionary in the presentation itself, but its nice to know that I have something useful to offer others.

Second, I answered a question posed on Richard Foote's blog the other day and was again pleasantly surprised when he said I got the answer right. "Spot on" was Richard's reply. Which was a nice surprise. I expected to be close but I also expected to have missed something out somewhere. That's the danger when you quickly reply to a question posted on the web, which becomes obvious when someone else points your mistake later. But this time I got it right.

Back to normal posts later,
John

Tuesday 19 July 2011

Beware: CPU Threads are not the same as CPU Cores #1

Craig Shallhamer recently did a number of posts about reporting CPU usage from the data Oracle was gathering, and in this he made a distinction between CPU Cores and CPU Threads. His focus was on the correlation between the values Oracle was reporting and those reported by other operating system tools - were they the same and could you directly use the values reported by Oracle? (His conclusion was that they were the same). What he didn't go into was the difference between CPU Cores and CPU Threads, and what the "real" CPU capacity of each type is. I think this is an important issue for anyone dealing with performance on modern computer systems, as the two types of "multi-CPU" are not the same, and they exhibit radically different scalability behaviour. And what compounds this is that most operating systems do not distinguish between the two types of CPU, and so will misreport (or lie if you want to) how much CPU capacity you have and how much is being used.

The purpose of this post is to try and bring out this distinction between the two types of "multi-CPU" in a single processor chip, and show that multi-threaded CPUs do not behave as expected, leading to unexpected and negative scalability results under increased workloads. The wrong time to discover that the reported "spare, unused CPU capacity" in your system is wrong is when you experience an increase in your transaction volume, and everything just runs slower rather maintaining the same response times.

Some definitions:
  • A CPU Core or Execution Core is a physical piece of silicon in the CPU chip that actually executes instructions on data values.
  • A CPU Thread is a virtualized thing that in reality sits on top of an underlying CPU Core.
The key difference is that because of the virtualization, multiple CPU Threads will share a single CPU core. Thus I might have 2 CPU Threads but only 1 CPU Core. If I have 2 CPU Cores then they can execute 2 instructions simultaneously, independently and in parallel. I have 2 times the capacity of a single CPU core. If I have 2 CPU Threads sharing a single CPU Core, then it can still only execute 1 instruction at a time. This is true regardless of the number of Threads being virtualized on the same CPU core - there is only one CPU core and it can only execute one instruction at a time. The virtualization layer may let it switch between the different Threads very often, so that they all have some of their instructions executed, but it is always only executing one instruction at a time.

If I have 2 CPU Cores I can get twice the work done in a period of time. But if I have 2 CPU Threads sharing the same physical CPU Core then I can get no more work done in a period of time than if there were only 1 CPU Thread. And I will show this in a moment.

Most operating systems though don't distinguish between these two types of multi-CPU flavours. They report CPU Cores and CPU Threads as if they each had an equal capacity to do work. Which means that most operating systems are over reporting the available CPU capacity if the underlying CPUs use CPU Threads on shared CPU Cores.

Tests
A solid example. I have a laptop with an Intel Core i3 M350 CPU in it. This has 2 physical CPU Cores, and on each of these it has 2 CPU Threads. This gives a total of 4 exposed CPUs, but on only 2 CPU cores. The laptop is running Linux (XUbuntu) and I have installed SAR via the sysstat package. /proc/cpuinfo shows 4 CPUs (processors 0 to 3), and clearly this is used by SAR in calculating CPU capacity and CPU utilisation.

I have a simple CPU test using the factor utility, which prints out the prime factors of any number. Any CPU only test will do, and factor is a CPU bound utility. Feed it a list of very big numbers and it will use up some CPU testing prime factors. By using the same numbers I can repeat this any time I want to. How will this behave when I run more instances of this test at the same time? The system says that it can see 4 CPUs. But I know that the Core i3 has only 2 physical CPU Cores to execute instructions on.

The only fly in the ointment is that because this laptop is running X Windows and I am typing in my results to it there is a small constant CPU overhead of around 5%. So the system is not 100% idle when I run these tests. I mention this to be as honest and open as I can be about the tests I did.

Running factor once against a list of large numbers (see elsewhere) and using "time" to measure its elapsed time, takes 7.779 seconds and produces the following SAR output:
12:49:44        CPU     %user     %nice   %system   %iowait    %steal     %idle
12:49:46 all 0.74 0.00 0.74 0.00 0.00 98.51
12:49:48 all 1.74 0.00 0.62 0.00 0.00 97.64
12:49:50 all 10.88 0.00 1.64 0.00 0.00 87.49
12:49:52 all 27.66 0.00 2.78 0.00 0.00 69.57
12:49:54 all 27.64 0.00 2.67 0.48 0.00 69.21
12:49:56 all 27.79 0.00 2.55 0.73 0.00 68.93
12:49:58 all 15.12 0.00 2.03 0.00 0.00 82.85
12:50:00 all 3.29 0.00 3.06 0.94 0.00 92.71
As expected, because Linux believes that there are 4 CPUs in the system and I have run one CPU bound task it reports that it is using 25% of the CPU capacity. "time" reported that 7.440 seconds of user CPU time was used. System CPU time is always negligible in these tests.

When I run 2 of these factor tests at the same time they take 8.034 seconds - about the same elapsed time allowing for other factors affecting this measurement.
12:52:16        CPU     %user     %nice   %system   %iowait    %steal     %idle
12:52:18 all 2.09 0.00 1.72 0.00 0.00 96.19
12:52:20 all 0.61 0.00 0.49 0.00 0.00 98.90
12:52:22 all 25.73 0.00 3.88 0.85 0.00 69.54
12:52:24 all 52.32 0.00 3.58 0.72 0.00 43.38
12:52:26 all 51.49 0.00 3.93 0.00 0.00 44.58
12:52:28 all 51.57 0.00 3.13 0.96 0.00 44.34
12:52:30 all 22.18 0.00 1.72 0.00 0.00 76.10
12:52:32 all 1.25 0.00 0.25 1.50 0.00 97.01
12:52:34 all 1.10 0.00 0.74 0.00 0.00 98.16
As expected, SAR is reporting 50% CPU utilization - double our 25% before - and still around 45% CPU capacity unused (idle).

Note also that "time" reports that 15.465 seconds of user CPU time was used by these 2 factor tests. This ratio of 2:1 between CPU used and elapsed time shows that the 2 factors were indeed running simultaneously in parallel.

Next I run 4 of the factor tests together. How long will they take - still 8 seconds? And what will SAR report?

In fact they took 13.555 seconds, with time reporting 47.991 user CPU seconds. SAR reported 100% utilisation:

12:56:40 CPU %user %nice %system %iowait %steal %idle
12:56:42 all 0.70 0.00 0.58 0.00 0.00 98.72
12:56:44 all 0.12 0.00 0.74 1.74 0.00 97.39
12:56:46 all 22.10 0.00 1.85 0.12 0.00 75.93
12:56:48 all 96.62 0.00 3.38 0.00 0.00 0.00
12:56:50 all 96.88 0.00 3.12 0.00 0.00 0.00
12:56:52 all 96.62 0.00 3.38 0.00 0.00 0.00
12:56:54 all 97.00 0.00 3.00 0.00 0.00 0.00
12:56:56 all 96.00 0.00 4.00 0.00 0.00 0.00
12:56:58 all 96.88 0.00 3.12 0.00 0.00 0.00
12:57:00 all 26.37 0.00 1.74 0.87 0.00 71.02
12:57:02 all 0.25 0.00 0.62 0.99 0.00 98.14
12:57:04 all 0.62 0.00 0.86 0.00 0.00 98.52

Now, if there were 4 real CPU cores I would not expect the elapsed time to increase from about 8 to 13.5 seconds (over 50% increase). Furthermore, something is definitely wrong with the reported CPU usage - 47.991 seconds (nearly 48) for 4 compared to 15.465 (15.5) for 2 factors. I'll come back to this in another post (the operating system is actually measuring allocation of CPU time, not real usage of CPU cycles). The fact that one factor used 7.440 seconds of CPU means that four factors should not use more than about 30 seconds (7.5 * 4). So 48 seconds is way off the mark.

For completeness 3 factors together take 10.946 seconds, with 29.738 CPU seconds, and SAR said about 75% utilisation (75% for factor + 5% for other background processes):

13:00:24 CPU %user %nice %system %iowait %steal %idle
13:00:26 all 1.24 0.00 0.99 0.12 0.00 97.65
13:00:28 all 1.70 0.00 0.73 0.85 0.00 96.72
13:00:30 all 21.98 0.00 2.56 0.00 0.00 75.47
13:00:32 all 79.15 0.00 4.49 0.62 0.00 15.73
13:00:34 all 79.72 0.00 3.50 0.00 0.00 16.77
13:00:36 all 79.75 0.00 3.88 0.00 0.00 16.38
13:00:38 all 74.63 0.00 2.74 0.00 0.00 22.64
13:00:40 all 52.11 0.00 3.49 1.32 0.00 43.08
13:00:42 all 12.07 0.00 1.52 0.00 0.00 86.40
13:00:44 all 2.31 0.00 0.85 1.33 0.00 95.51
13:00:46 all 1.43 0.00 0.95 0.00 0.00 97.62
From this I draw the following conclusions:

First, as I described, CPU Threads are virtual and the real measure of available CPU capacity is CPU cores. That is the "real CPU capacity available", regardless of the number of threads on top of the physical CPU cores.

This is shown by the increase in elapsed time when running more than 2 factors together. If there really were 50% CPU capacity unused and idle then there should be no significant increase in elapsed time for 3 or 4 factors together compared to only 1 or 2.

Interestingly 8 factors took 26.790 seconds and "time" reported 1 minute 37.194 user CPU seconds (97.194 seconds). Both almost perfectly double the 4 factors test results, because the CPUs are fully saturated.

Second, the operating system and hence SAR is lying when it says that there is 50% CPU capacity unused and available when I ran 2 factors together. A "CPU Thread" is unfortunately not a real CPU, but is rather sharing a real CPU with other "CPU Threads".

Third, this is also important to Oracle based systems. A "perfect" or "ideal" Oracle system is both well tuned and running a well designed and written application. It will minimise disk I/Os because they are the slowest operation, and it will scale well as both workload and system capacity are increased. This means that in reality such an Oracle system is CPU bound - transactions are limited by how fast the CPUs are, and not by how fast the disks are. Adding more transactions by increasing the workload will require more CPU capacity to maintain current response times.

Using CPUs that expose multiple CPU Threads on shared CPU execution cores will result in poor scalability under increasing workloads, and a significant increase in response times when the used CPU capacity reported equals the number of physical CPU cores. This is exactly what happened in my tests - up to 50% reported CPU utilisation the elapsed times of factor remained constant, but beyond 50% reported CPU utilisation the elapsed times increased in proportion to the number of concurrent factor's running.

Personally I would either avoid all such multi-threaded CPUs, or would switch off all the extra Threads leaving only one CPU Thread per CPU Core. That way I know that the operating system is reporting "real" CPU capacity and utilisation, and that there is no hidden knee in the response time as I increase the CPU usage under increasing workloads.



Source Code
The factor test is just running factor with input redirected from a list of very large prime numbers:
factor < factors 
I can run this and time how long it takes. I run multiple factors by putting the same command into a file and running that. Each factor runs at the same time (& runs it as a separate, child process).

#!/bin/sh
#
factor < factors &
factor < factors &
#
wait
The list of factors is the same set of large prime numbers repeated 10 times in the file:
999999999989
99999999999973
99999999999971
99999999999959
99999999999931
99999999999929
99999999999923
99999999999853
99999999999829
99999999999821
99999999999797
99999999999791
99999999999701
99999999999673
99999999999503
99999999999481
99999999999469
99999999999467
99999999999463
99999999999457

Tuesday 28 June 2011

Histograms are not what I assumed

One of the themes I'll keep coming back to is how things are often not how you have assumed they are, and how there are always opportunities to learn something new about Oracle. I am quite familiar with histograms in Oracle from reading the manuals and white papers, but have always left it to Oracle to decide what columns histograms were needed. This is mainly due to the databases I have worked on not being large or complicated enough to need further analysis so far. But knowing the day would come when I would have to do something I thought I would review what I knew, and in doing so learnt a few new things - or rather realised that some of my assumptions had been incorrect.

Rather than re-read the manual I decided to go to a better source of knowledge rather than just raw information - Jonathan Lewis's Cost Based Oracle Fundamentals. I already knew that Oracle had 2 types of histogram - Frequency and Height Balanced. And I knew that Frequency stored a row count per individual data value, while a Height Balanced stores a row count per range of values (strictly the row count is the same per range, and it is the data range size that varies). And I knew that the Optimizer in Oracle used this extra row count data in the histogram to calculate better estimates for row counts that would match constraints in queries. In reading the chapter on Histograms and working through the examples and doing my own tests I realised that some of my assumptions had been wrong about how Oracle uses these histograms.

Assumption 1 - Oracle uses all the entries in a histogram to calculate its row count estimate

I knew this to be true for a Frequency Histogram, and had assumed something similar for a Height Balanced one. But I was wrong. In a height balanced histogram although Oracle may have up to 254 buckets, it only uses the ones that contain the same end point values to identify "popular values" - those that occur more frequently than the width of a single bucket, and so span more than one bucket in size. For all other values it ignores the histogram completely and uses the column level Density statistic in the same manner as if there were no histogram.

In other words there is a threshold, being the number of rows covered by a single bucket. For data values that occur more often than this and are also recorded in two or more buckets in a height balanced histogram, the Optimizer uses the number of buckets to calculate the estimated row count, along with the number of rows per bucket. For all other values Oracle assumes a uniform distribution and uses the column level Density statistic. This will not be the same as one over the Number of Distinct Values (NDV), but is calculated differently to remove the effect of the rows for those popular values i.e. the Density is lowered to the average of the "unpopular" values.

Assumption 2 - A Frequency Histogram has an entry for all data values occurring in the table

This is true if either the table is relatively small, or you force Oracle to read all the data rows when gathering statistics. If you leave Oracle to use its default sampling method and you have a large table then some values may not be sampled, and they will be missing from the Frequency Histogram produced.

What Oracle does is to calculate and store a Density value for the column that is half that of the least frequent occurring value in the histogram. Values that appear in query constraints that do not appear in the Frequency Histogram are therefore assumed to occur at half the row count of the value with the smallest row count in the histogram. So again, the Optimizer may end up not using a histogram that exists and instead use the Density statistic of the column when executing a particular query.

None of this is new, and I was able to double check this via various other sources. It was just another thing to add to the list of assumptions I've made in the past that turn out not to be true. Even with histograms in place there is a reasonable chance that the Optimizer will actually not be using the histogram itself but instead the Density statistic value of the column. And also changing the value of the Density for the column can have an impact on queries, even when there is a histogram on that column.

Thursday 16 June 2011

Hinting - Lessons Learnt

Following on from my previous post about some SQL that needed to be tuned, I thought I'd summarise some additional important lessons I've learnt about using hints. I already knew quite a lot about hints in Oracle from reading different sources, and to avoid them as much as possible. In spite of this I have still learnt a few new things about using hints in practise. I know that others have said most of this before, but it bears repeating because it really does affect whether any hints do work or not.
  1. Name all your query blocks using the qb_name hint. This clarifies things both in terms of how Oracle itself reports back your execution plan, and in terms of how you specify your hints.
  2. Qualify all object references with query block names e.g. t1@mainq. Again, this is how Oracle reports object references, and it is precise in terms of where the hint should be applied.
  3. Check your hints are being used with the 'outline' option to dbms_xplan.display_cursor. If the hints being listed are the same ones that you used, then all is well. If not, then it is likely that some of your hints are actually being ignored.
  4. Test the hints individually. This follows on from the previous point about proving the hint is recognised and honoured by Oracle. It is possible that a similar hint is being automatically produced by the Optimizer as a result of another hint you are using. In my case it looks like a USE_NL hint was being ignored, but a Nested Loop was produced anyway because of an INDEX hint.
  5. Include all other relevant hints, such as LEADING and UNNEST. Previously I would have assumed that these would be produced by the Optimzer automatically but Jonathan Lewis includes them in his hint sets so they must be relevant.

Thursday 9 June 2011

Customers who won't wait to be helped

[I'm back. I haven't posted for a long while due to a combination of being busy and somewhat losing focus on the blogging side, but I hope to post more frequently in the future. I was going to start with a post on histograms in Oracle but ...]

I work for a software house, whose heavyweight application runs on top of Oracle. Earlier this week I was dealing with a customer who had poor performance on just one SQL query after having updated their statistics. I spent a day analysing the SQL and the data - what was it trying to do and why - and another day trying different solutions to make it use a different execution plan and run faster for them. I then sent the results to the support people dealing with this customer so that the solution could be implemented. Imagine my surprise then when I see today that Jonathan Lewis has a blog post about the same SQL statement, and makes reference to a post on one of the Oracle Technical Forums.

I'm surprised that a customer would post SQL from a third party application on the web like that, and that they would somehow expect a better answer to come back than from the software vendor themselves. I was even more surprised because they were asking for a set of hints to change the execution plan, and I had just come up with such a set of hints just 2 days earlier. Okay, maybe this had not been forwarded to the customer yet, because our Support department was doing further testing on it first. So it is highly likely that the customer had not seen my solution yet. But I'm still amazed that a customer would be willing to put into production some recommendation they had got off a web forum, and on the basis that they could not work out a better solution themselves - the poster did not know how to do hints properly.

In terms of the problem itself, the cause of the poor performance is actually the extreme skew within the data in this table. On average most of the date values occur thousands of times, but the particular value used in the query only occurs about ten times in the table. Hence another execution plan is faster for this particular value.

Which brings me back to histograms, which I was planning on doing a completely different post on anyway. There is a histogram on the constrained column, but it is a height balanced histogram because there are over 254 distinct values, and so there are no statistics on row counts of individual values other than the most popular values. The value used in the query is unpopular, and the average number of rows per value across all the unpopular values is in the tens of thousands. Hence the choice of an execution plan using a HASH JOIN, as that scales in a different way to higher row counts than a NESTED LOOP join does.

Maybe Oracle should introduce a new type of histogram? One that records both the most popular values, and the least popular values, and then stores a single average for all the values in between (the density). That would have helped here. It certainly seems the case that Oracle does not handle heavily skewed data sets well, though of course you should never expect a "one size fits all" solution to be perfect for all possible scenarios. What the Optimizer does is try and produce an execution plan that is the best for most of the possible input values. Which is what it did here, based on the statistics available for the average number of rows per date value via the density statistic.

Another viable solution is to make the Optimizer believe that the date constraint will match far fewer rows and so choose the NESTED LOOP join method itself. This is the approach put forward by Wolfgang Breitling in his paper on Tuning by Cardinality Feedback, in which he suggests changing the Density statistic stored for the column. And indeed reducing the density value by a factor of 100 has this effect. The upside of this approach is that it avoids the need to change the SQL or use hints, which do not always work as intended.

Monday 7 March 2011

I hate Agile Development

I really hate, loath and detest "Agile Development". At least I hate how they use it where I work. For the record I actually agree with most of the principles and points of Agile Development. I'm pretty sure I have said so before. But what they do in the name of Agile Development where I work is really terrible.

The list of "crimes" conducted by the development teams against databases and development in general by their "use" of Agile methods continues to get longer and longer. Lately I have had to deal with a number of such crimes and their consequences.
  • One group went ahead with development against nothing more than a strawman design, that in turn was for only verbally stated requirements. No one could agree in a meeting I attended what the actual requirements were. The number of times someone said "I assumed he meant ..." was incredible.
  • Another group designed and implemented their own tables, but now have no time to have their database design reviewed before the release deadline. It "must" be delivered as it is, because it is too late to change anything.
  • Another group simply did not bother to document anything. When asked at how they arrived at their final solution and the analysis they went through, they could not explain or justify anything.
  • All of the groups have used the phrase "we left that for another sprint" for some major, critical piece of functionality. They are simply ignoring anything that is too difficult.
  • Iteration speed is more important than getting it right or meeting all the requirements. The fact that the speed of iteration is introducing more errors that need to be corrected by future sprints never gets flagged by anyone. I am sure the quality of what is being delivered is going down, but everyone else is happy that we now have some more buttons to click on the user facing screens and forms.
  • Design and code reviews are simply ignored. The only measurement seems to be the volume of exposed and visible functionality delivered. Whether it works or not is never really considered. By that time the developers have moved onto the next sprint, and the next piece of functionality to be delivered. The push is for additional functionality added to the application software at all costs.
  • Many application level requirements are simply ignored in early development sprints, just to deliver some usable functionality as early as possible. The fact that the initial design and development cannot be extended in the future to meet the ignored requirements is itself simply ignored.
I really want to like Agile, but the things done in the name of Agile are truly horrible. The attitude of simply ignoring critical dependencies and then expecting the rest of the world to reorganise itself around their particular problem and solve it for them is amazing. And if you don't put everything down immediately and jump , you are accussed of being the blocker who is holding up development and stopping delivery of the wonderful software they have all developed. The fact that it is seriously crippled and will never work properly is somehow never their problem. And the most common answer to my any question I ask about the development they did is "we never had time to do that (so we didn't)".

I now term Agile the "ignore it today, it will be someone else's problem tomorrow" development methodology. I really do think that many of the developer's believe in a world of magic fairies or pixies that will sprinkle magic Agile dust everywhere in the middle of the whole development, and all of their ignored issues will just disappear and go away. Unfortunately, everyone else is living and dealing with the real world. Have you ever tried living in a house built and delivered one room at a time, all with separately laid foundations, and installed plumbing, electrics, windows and doors? It is a mess, and it gets worse over time as more and more rooms are added on, because there was never an overall design, and no one ever bothered to think about future requirements until they had to deliver them.