Tuesday, 19 July 2011

Beware: CPU Threads are not the same as CPU Cores #1

Craig Shallhamer recently did a number of posts about reporting CPU usage from the data Oracle was gathering, and in this he made a distinction between CPU Cores and CPU Threads. His focus was on the correlation between the values Oracle was reporting and those reported by other operating system tools - were they the same and could you directly use the values reported by Oracle? (His conclusion was that they were the same). What he didn't go into was the difference between CPU Cores and CPU Threads, and what the "real" CPU capacity of each type is. I think this is an important issue for anyone dealing with performance on modern computer systems, as the two types of "multi-CPU" are not the same, and they exhibit radically different scalability behaviour. And what compounds this is that most operating systems do not distinguish between the two types of CPU, and so will misreport (or lie if you want to) how much CPU capacity you have and how much is being used.

The purpose of this post is to try and bring out this distinction between the two types of "multi-CPU" in a single processor chip, and show that multi-threaded CPUs do not behave as expected, leading to unexpected and negative scalability results under increased workloads. The wrong time to discover that the reported "spare, unused CPU capacity" in your system is wrong is when you experience an increase in your transaction volume, and everything just runs slower rather maintaining the same response times.

Some definitions:
  • A CPU Core or Execution Core is a physical piece of silicon in the CPU chip that actually executes instructions on data values.
  • A CPU Thread is a virtualized thing that in reality sits on top of an underlying CPU Core.
The key difference is that because of the virtualization, multiple CPU Threads will share a single CPU core. Thus I might have 2 CPU Threads but only 1 CPU Core. If I have 2 CPU Cores then they can execute 2 instructions simultaneously, independently and in parallel. I have 2 times the capacity of a single CPU core. If I have 2 CPU Threads sharing a single CPU Core, then it can still only execute 1 instruction at a time. This is true regardless of the number of Threads being virtualized on the same CPU core - there is only one CPU core and it can only execute one instruction at a time. The virtualization layer may let it switch between the different Threads very often, so that they all have some of their instructions executed, but it is always only executing one instruction at a time.

If I have 2 CPU Cores I can get twice the work done in a period of time. But if I have 2 CPU Threads sharing the same physical CPU Core then I can get no more work done in a period of time than if there were only 1 CPU Thread. And I will show this in a moment.

Most operating systems though don't distinguish between these two types of multi-CPU flavours. They report CPU Cores and CPU Threads as if they each had an equal capacity to do work. Which means that most operating systems are over reporting the available CPU capacity if the underlying CPUs use CPU Threads on shared CPU Cores.

Tests
A solid example. I have a laptop with an Intel Core i3 M350 CPU in it. This has 2 physical CPU Cores, and on each of these it has 2 CPU Threads. This gives a total of 4 exposed CPUs, but on only 2 CPU cores. The laptop is running Linux (XUbuntu) and I have installed SAR via the sysstat package. /proc/cpuinfo shows 4 CPUs (processors 0 to 3), and clearly this is used by SAR in calculating CPU capacity and CPU utilisation.

I have a simple CPU test using the factor utility, which prints out the prime factors of any number. Any CPU only test will do, and factor is a CPU bound utility. Feed it a list of very big numbers and it will use up some CPU testing prime factors. By using the same numbers I can repeat this any time I want to. How will this behave when I run more instances of this test at the same time? The system says that it can see 4 CPUs. But I know that the Core i3 has only 2 physical CPU Cores to execute instructions on.

The only fly in the ointment is that because this laptop is running X Windows and I am typing in my results to it there is a small constant CPU overhead of around 5%. So the system is not 100% idle when I run these tests. I mention this to be as honest and open as I can be about the tests I did.

Running factor once against a list of large numbers (see elsewhere) and using "time" to measure its elapsed time, takes 7.779 seconds and produces the following SAR output:
12:49:44        CPU     %user     %nice   %system   %iowait    %steal     %idle
12:49:46 all 0.74 0.00 0.74 0.00 0.00 98.51
12:49:48 all 1.74 0.00 0.62 0.00 0.00 97.64
12:49:50 all 10.88 0.00 1.64 0.00 0.00 87.49
12:49:52 all 27.66 0.00 2.78 0.00 0.00 69.57
12:49:54 all 27.64 0.00 2.67 0.48 0.00 69.21
12:49:56 all 27.79 0.00 2.55 0.73 0.00 68.93
12:49:58 all 15.12 0.00 2.03 0.00 0.00 82.85
12:50:00 all 3.29 0.00 3.06 0.94 0.00 92.71
As expected, because Linux believes that there are 4 CPUs in the system and I have run one CPU bound task it reports that it is using 25% of the CPU capacity. "time" reported that 7.440 seconds of user CPU time was used. System CPU time is always negligible in these tests.

When I run 2 of these factor tests at the same time they take 8.034 seconds - about the same elapsed time allowing for other factors affecting this measurement.
12:52:16        CPU     %user     %nice   %system   %iowait    %steal     %idle
12:52:18 all 2.09 0.00 1.72 0.00 0.00 96.19
12:52:20 all 0.61 0.00 0.49 0.00 0.00 98.90
12:52:22 all 25.73 0.00 3.88 0.85 0.00 69.54
12:52:24 all 52.32 0.00 3.58 0.72 0.00 43.38
12:52:26 all 51.49 0.00 3.93 0.00 0.00 44.58
12:52:28 all 51.57 0.00 3.13 0.96 0.00 44.34
12:52:30 all 22.18 0.00 1.72 0.00 0.00 76.10
12:52:32 all 1.25 0.00 0.25 1.50 0.00 97.01
12:52:34 all 1.10 0.00 0.74 0.00 0.00 98.16
As expected, SAR is reporting 50% CPU utilization - double our 25% before - and still around 45% CPU capacity unused (idle).

Note also that "time" reports that 15.465 seconds of user CPU time was used by these 2 factor tests. This ratio of 2:1 between CPU used and elapsed time shows that the 2 factors were indeed running simultaneously in parallel.

Next I run 4 of the factor tests together. How long will they take - still 8 seconds? And what will SAR report?

In fact they took 13.555 seconds, with time reporting 47.991 user CPU seconds. SAR reported 100% utilisation:

12:56:40 CPU %user %nice %system %iowait %steal %idle
12:56:42 all 0.70 0.00 0.58 0.00 0.00 98.72
12:56:44 all 0.12 0.00 0.74 1.74 0.00 97.39
12:56:46 all 22.10 0.00 1.85 0.12 0.00 75.93
12:56:48 all 96.62 0.00 3.38 0.00 0.00 0.00
12:56:50 all 96.88 0.00 3.12 0.00 0.00 0.00
12:56:52 all 96.62 0.00 3.38 0.00 0.00 0.00
12:56:54 all 97.00 0.00 3.00 0.00 0.00 0.00
12:56:56 all 96.00 0.00 4.00 0.00 0.00 0.00
12:56:58 all 96.88 0.00 3.12 0.00 0.00 0.00
12:57:00 all 26.37 0.00 1.74 0.87 0.00 71.02
12:57:02 all 0.25 0.00 0.62 0.99 0.00 98.14
12:57:04 all 0.62 0.00 0.86 0.00 0.00 98.52

Now, if there were 4 real CPU cores I would not expect the elapsed time to increase from about 8 to 13.5 seconds (over 50% increase). Furthermore, something is definitely wrong with the reported CPU usage - 47.991 seconds (nearly 48) for 4 compared to 15.465 (15.5) for 2 factors. I'll come back to this in another post (the operating system is actually measuring allocation of CPU time, not real usage of CPU cycles). The fact that one factor used 7.440 seconds of CPU means that four factors should not use more than about 30 seconds (7.5 * 4). So 48 seconds is way off the mark.

For completeness 3 factors together take 10.946 seconds, with 29.738 CPU seconds, and SAR said about 75% utilisation (75% for factor + 5% for other background processes):

13:00:24 CPU %user %nice %system %iowait %steal %idle
13:00:26 all 1.24 0.00 0.99 0.12 0.00 97.65
13:00:28 all 1.70 0.00 0.73 0.85 0.00 96.72
13:00:30 all 21.98 0.00 2.56 0.00 0.00 75.47
13:00:32 all 79.15 0.00 4.49 0.62 0.00 15.73
13:00:34 all 79.72 0.00 3.50 0.00 0.00 16.77
13:00:36 all 79.75 0.00 3.88 0.00 0.00 16.38
13:00:38 all 74.63 0.00 2.74 0.00 0.00 22.64
13:00:40 all 52.11 0.00 3.49 1.32 0.00 43.08
13:00:42 all 12.07 0.00 1.52 0.00 0.00 86.40
13:00:44 all 2.31 0.00 0.85 1.33 0.00 95.51
13:00:46 all 1.43 0.00 0.95 0.00 0.00 97.62
From this I draw the following conclusions:

First, as I described, CPU Threads are virtual and the real measure of available CPU capacity is CPU cores. That is the "real CPU capacity available", regardless of the number of threads on top of the physical CPU cores.

This is shown by the increase in elapsed time when running more than 2 factors together. If there really were 50% CPU capacity unused and idle then there should be no significant increase in elapsed time for 3 or 4 factors together compared to only 1 or 2.

Interestingly 8 factors took 26.790 seconds and "time" reported 1 minute 37.194 user CPU seconds (97.194 seconds). Both almost perfectly double the 4 factors test results, because the CPUs are fully saturated.

Second, the operating system and hence SAR is lying when it says that there is 50% CPU capacity unused and available when I ran 2 factors together. A "CPU Thread" is unfortunately not a real CPU, but is rather sharing a real CPU with other "CPU Threads".

Third, this is also important to Oracle based systems. A "perfect" or "ideal" Oracle system is both well tuned and running a well designed and written application. It will minimise disk I/Os because they are the slowest operation, and it will scale well as both workload and system capacity are increased. This means that in reality such an Oracle system is CPU bound - transactions are limited by how fast the CPUs are, and not by how fast the disks are. Adding more transactions by increasing the workload will require more CPU capacity to maintain current response times.

Using CPUs that expose multiple CPU Threads on shared CPU execution cores will result in poor scalability under increasing workloads, and a significant increase in response times when the used CPU capacity reported equals the number of physical CPU cores. This is exactly what happened in my tests - up to 50% reported CPU utilisation the elapsed times of factor remained constant, but beyond 50% reported CPU utilisation the elapsed times increased in proportion to the number of concurrent factor's running.

Personally I would either avoid all such multi-threaded CPUs, or would switch off all the extra Threads leaving only one CPU Thread per CPU Core. That way I know that the operating system is reporting "real" CPU capacity and utilisation, and that there is no hidden knee in the response time as I increase the CPU usage under increasing workloads.



Source Code
The factor test is just running factor with input redirected from a list of very large prime numbers:
factor < factors 
I can run this and time how long it takes. I run multiple factors by putting the same command into a file and running that. Each factor runs at the same time (& runs it as a separate, child process).

#!/bin/sh
#
factor < factors &
factor < factors &
#
wait
The list of factors is the same set of large prime numbers repeated 10 times in the file:
999999999989
99999999999973
99999999999971
99999999999959
99999999999931
99999999999929
99999999999923
99999999999853
99999999999829
99999999999821
99999999999797
99999999999791
99999999999701
99999999999673
99999999999503
99999999999481
99999999999469
99999999999467
99999999999463
99999999999457

4 comments:

Flado said...

Given your conclusion that hyperthreading is best disabled, a legitimate question arises - why then do CPU makers even bother with the feature?
Here's my take on that: because the CPU core is by far the fastest thing in a computer.
So, unless you run a program that can fit entirely in the CPU conveyor and only uses data that can fit in the CPU's registers, your CPU core (or rather, its ALU) will spend most of its time waiting for the other components of the system to do their job. Wouldn't it be nice if the CPU core could execute a few instructions from some other thread while waiting for the next instruction or data piece for the current thread to be delivered from RAM (or L2/L3 cache)? And this without a context switch?
That's what hyperthreading is about.

Now, I don't think the Oracle instance does a lot of prime factoring (or square roots, or fast Fourier transforms, etc.). In fact, the only truly CPU-intensive operation I can think of is a spin-wait on a latch - and we hope to reduce that. Hence, depending on your workload, hyperthreading might just be beneficial in an Oracle database environment.

Cheers,
Flado

John Brady said...

Flado,

I agree with the point about using otherwise wasted CPU cycles. I used to work for Sun and read up on their multi-threaded Niagara / T1 CPU at the time, so I "get" the idea.

My point is that the operating systems are blind to this and misreport how much real CPU capacity is available and how much is being used. And this inaccuracy in the CPU utilization report will lead to people coming to the wrong conclusion and making the wrong decision as a result.

You can see from my examples that the operating system reported that there was 50% CPU capacity unused. Yet when I ran another program there was no "real CPU capacity" available, and all the programs ran slower as a result.

I just wish that the operating systems got their information from the CPU itself, to give a true picture of how many cycles were being used, and for which processes / threads.

John

clarf said...

Hi,

This is a nice read about the real core vs Thread topic. It could be interesting to have the same test in the same machine with the HT feature disabled.

Although the conclusion is oriented to the OS measures, the misinterpretation of the reads and the use of this information for future scalability. I´m sure a new test on such conditions could tell something about the real performance gain of a HT feature on high parallelism systems.

I mean the scheduler module on the OS will behavior different and we could have actually a better performance on your test when 4 factors are executed.

Regards,
Andrew

clarf said...

Hi,

This is a nice read about the real core vs Thread topic. It could be interesting to have the same test in the same machine with the HT feature disabled.

Although the conclusion is oriented to the OS measures, the misinterpretation of the reads and the use of this information for future scalability. I´m sure a new test on such conditions could tell something about the real performance gain of a HT feature on high parallelism systems.

I mean the scheduler module on the OS will behavior different and we could have actually a better performance on your test when 4 factors are executed.

Regards,
Andrew