Tanel Poder has been doing a series of posts on "
RAM is the new disk" trying to show how the new "
Oracle Database In-Memory"
feature in Oracle 12c uses memory in a different and more efficient way
than the normal, traditional buffer cache does. He hasn't finished the
series of posts yet, but I started to draw some of my own conclusions
from the data he published in
his last post and I'd thought I'd publish my thoughts.
I
know a reasonable amount about how CPU's work internally, so I assumed
that the main performance improvement of Oracle Database In-Memory would
be from a better memory access pattern that would significantly reduce
the number of execution stalls inside the CPU. Tanel is one of the good
Oracle bloggers, and he has been sharing the details of the tests he
has been doing in this series of blog posts as well as the results from
them so I can see if my hypothesis on the new Database In-Memory holds up.
Tanels' Tests
The main results are in
this post
where he gives the 6 SQL queries he ran and their results in terms of
elapsed time and CPU execution cycles - some against the traditional
buffer cache and some against the new In-Memory cache. No disk accesses were involved as all the data was already in memory in the Oracle SGA.
I'm
only interested in the main 2 SQL queries that are the same as each
other, except the first one gets its data from the traditional data
block based buffer cache in memory while the second gets its data from
the new In Memory column storage based cache i.e. both execute the same "
SELECT
" against the same table and all of the data needed is already in memory in the SGA.
3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)
SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_id > 0;
4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0
SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_id > 0;
Tanel has also made all the low level CPU execution measurements available in a
Google online spreadsheet. Here is a cut and paste of the CPU measurements for just the 2 queries I am interested in:
Metric | TABLE BUFCACHE | TABLE INMEMORY PRED |
task-clock-ms | 27374 | 1578 |
cycles | 86428653040 | 4573793724 |
instructions | 32115412877 | 7080326242 |
branches | 7386220210 | 940579984 |
branch-misses | 22056397 | 4637243 |
stalled-cycles-frontend | 76697049420 | 2251325295 |
stalled-cycles-backend | 58627393395 | 1328333827 |
cache-references | 256440384 | 11507915 |
cache-misses | 222036981 | 7316366 |
LLC-loads | 234361189 | 9712269 |
LLC-load-misses | 218570294 | 7272805 |
LLC-stores | 18493582 | 1697666 |
LLC-store-misses | 3233231 | 27797 |
L1-dcache-loads | 7324946042 | 1069917316 |
L1-dcache-load-misses | 305276341 | 85368159 |
L1-dcache-prefetches | 36890302 | 25169253 |
The most obvious observation is that the In-Memory query execution is 17.34 times faster than the traditional buffer cache query execution (elapsed
task-clock-ms
time ratio of 27374 / 1578). I'm most interested in trying to explain what has caused this difference in elapsed time.
My Hypothesis & CPU / Memory Access
At
the beginning of this post I stated that I assumed that the main
performance improvement would be a result of a reduction in the number
of execution stalls inside the CPU. Let me try and explain what I mean
by that.
In the very old days of Intel CPU's (from the original
8086 up to the first '486 I think) the internal clock of the CPU (the
rate at which it executed instructions) was the same as the external
clock of the system which also governed memory access times. A read
from the CPU of a memory location would take something like 3 or 4
external clock cycles (put the address on the system bus, the memory to
process that to its output, put the data out on the bus to the CPU).
During these clock cycles the CPU cannot proceed with the execution of
that instruction i.e. it stalls until the data it needs from memory has
been read into the CPU.
During the '486 lifetime Intel introduced
technology that allowed the internal clock of the CPU to run faster
than the external clock i.e. at a multiple of the external system clock.
Now we have external clocks of 100 MHz typically and CPU internal clocks of 3 GHz and more i.e. the CPU is running internally at 30 or more times the external system clock rate.
A
side effect of this change is that now when a CPU "stalls" to read
needed data from memory, the stall is for much longer in terms of CPU
instruction cycles. As the external clock is now 30 or more times
slower than the internal clock of the CPU, the CPU may end up waiting
doing nothing for over 100 execution cycles. NB. This is a
simplification of what really happens, but it does describe a real
effect.
We can see this effect in the separately reported
statistics for "CPU cycles" and "CPU instructions completed" - the first
one will be much higher than the second one. The "efficiency" of how a
CPU executed a given piece of program code can be seen from the
calculated "instructions per cycle" value i.e. "CPU instructions
completed" divided by "CPU cycles". The closer this value is to 1 the
more efficient the execution has been within the CPU. The lower this
value is then the more "stalls" occurred within the CPU during execution
that stopped it completing execution of an instruction every clock
cycle.
Of course CPU designers have tried to work around or
reduce the effect of such stalls in various ways, the details of which
are not relevant here. It just means that a) there is various
technology in the CPU to try and minimise the occurrence and impact of
such "memory load stalls", and b) the net impact of such "memory load
stalls" is much less than the 100 or so wasted execution cycles I've
described. In fact, and this is also important, modern CPU's can
actually issue more than one instruction at the same time for execution
within the same clock cycle e.g. an integer operation and a floating
point operation. This can result in an "instructions completed per
clock cycle" value larger than 1, which can occur when executing well
optimised program code.
If my hypothesis is correct then we
should see a difference in the number of "stalls" in the CPU execution
statistics between the 2 SQL query executions, and the ratio between
these 2 sets of "stall" statistics should be close to the observed ratio
in execution times.
My Initial Conclusions
To cut to the
chase a bit, the observed difference in execution times between the 2
SQL queries is not wholly explained by just a reduction in the number of
"stalls" inside the CPU, but the reduction in "stalls"
is
a significant contributor to this reduction in elapsed time. The
reduction in elapsed time is roughly a 50:50 split between a reduction
in the total number of instructions executed (code efficiency) and CPU
memory access stalls (memory access efficiency).
We can see from
the measurements that Tanel has reported that the number of instructions
executed by the CPU decreased by a factor of 4.54 (32115412877 /
7080326242) to process the same amount of data (rows in the table).
This is a significant improvement in code efficiency to process this
volume of data - probably a result of the In-Memory cache using far simpler internal data structures than the traditional buffer cache that can in turn be processed much easier.
This
leaves an improvement factor of 3.82 (17.34 / 4.54) that needs to be
explained by something else. I propose that this other factor is due to
a reduction in the number of "stalls" within the CPU when executing
these instructions as described earlier. This will have the net effect
of increasing the "instructions per cycle" value during the execution of
the SQL query, and indeed we can see this from the measurements
reported by Tanel:
- Buffer cache SQL query execution has CPU instructions per cycle of 32115412877 / 86428653040 = 0.37
- In-Memory cache SQL query execution has CPU instructions per cycle of 7080326242 / 4573793724 = 1.55
- This gives an improvement ratio of 1.55 / 0.37 = 4.17
And this value is very close to the 3.82 value for the other performance improvement factor.
Trying
to drill down further into this difference in "instructions per cycle"
and identifying "CPU memory stalls" due to data access is difficult if
not impossible for a number of reasons:
- Above the L1 cache on
the CPU the other CPU caches are shared by both instructions and data.
We cannot tell which cache reads and which cache misses were due to
fetching an instruction to execute or fetching a piece of data needed by
an instruction being executed. Both lead to CPU stalls but for
different reasons.
- The In-Memory cache SQL query is executing a
completely different code path within Oracle with almost 78% fewer
instructions being executed in total (100 / 4.54). So we cannot assume
that the instruction execution profile and also instruction cache misses
are similar in both cases.
We can however see other evidence
that this other factor for the improvement in performance is due to a
reduction in CPU stalls. Tanel has reported two other CPU execution
statistics - "
stalled-cycles-frontend
" and "
stalled-cycles-backend
". Tanel gives descriptions for these in his
part 2 post, which I will summarise as follows:
- Stalled
cycles front end occurs before an instruction gets executed by the CPU
and is generally due to fetching an instruction from memory
- Stalled
cycles back end occurs during instruction execution and is generally
due to fetching data from memory needed by that instruction
So
the most relevant statistic for stalls due to reading data from memory
is "stalled-cycles-backend". We cannot compare these directly between
the two SQL query executions because they executed a different number of
instructions (remember the 4.54 reduction factor). What we can do is
normalise these to a "stalled cycle per cycle" value and then compare
them:
- Buffer cache SQL query execution has backend stalls per cycle of 58627393395 / 86428653040 = 0.68
- In-Memory cache SQL query execution has backend stalls per cycle of 1328333827 / 4573793724 = 0.29
- This gives an improvement ratio of 0.68 / 0.29 = 2.34
Again,
this is a significant improvement and shows that the number of stalls
due to memory accesses by the CPU for data has been reduced.
Why Have Stalls Reduced?
I hope that I have shown that part of the improvement in the performance of the In-Memory
cache has been the reduction in the number of "memory access stalls" in
the CPU when executing the SQL query i.e. less time wasted waiting to
read data in from main memory with a corresponding increase in the
number of instructions executed per clock cycle and less total elapsed
to execute those instructions. But how has Oracle actually managed to
achieve this? How has it managed to reduce the number of "memory access
stalls" while executing the same SQL query i.e. if it needs to read the
same "set of data" to perform the same SQL query, how has it managed to
read it more efficiently?
I am guessing that this is due to two specific factors:
- Data layout in memory - the layout is more "compact" which decreases the total amount of memory that must be read
- Sequential
access to that memory, causing the CPU to "prefetch" the next memory
line into the CPU cache, so it is already there when it is needed i.e.
overlapping the processing of one memory line with the fetching of the
next memory line
Note that this is based on what I have read on In-Memory
as I've not yet had a chance to do any real testing of it yet. Again,
thanks to Tanel for both doing some real testing and for sharing all the
results he got from those tests.
In the traditional buffer cache
a data block will span multiple memory lines (assuming a 64 byte memory
cache line, as Tanel states) and this will contain data for all the
columns in a row. When you only need the data from one column it means
that you are skipping about in memory quite a lot, jumping from row to
row, and not using most of the data in each block or memory line. In
the new In-Memory cache the data is stored by column, so all the data
for one column is stored together, next to each other in memory. This
immediately reduces the total amount of memory that must be read to
access the data for one column across all the rows in the table.
It
also will increase the cache efficiency within the CPU, because one
memory line in the cache will contain data for multiple rows i.e. the
CPU can process the data for that column from multiple rows before it
needs to read in another memory line into cache to get the next set of
rows' data.
If the data for a column is physically laid out
sequentially and contiguously in the system's memory, then it means that
Oracle will be doing a series of sequential memory reads to process all
the data for one column. Modern CPU's can detect this kind of
sequential memory access and have an optimisation to "prefetch" the next
data line from memory while the previous data line is still being
processed. The net result is that stalls are either eliminated because
the next line of data is already in the CPU cache or the impact of
stalls is significantly reduced because the data is already on its way
from memory to the CPU cache.
We can see evidence for this occurring in the "
L1-dcache-prefetches
"
CPU statistic. The "L1 data cache" is the first on chip cache
immediately used by the CPU core, and is small but very fast. All data
needed by the CPU goes through this cache. Again, normalising to a
"prefetch per instruction executed" we can see that the number of data
prefetches by the CPU increased significantly during the In-Memory cache SQL query execution:
- Buffer cache SQL query execution has L1 dcache prefetch per instruction executed of 36890302 / 32115412877 = 0.00115
- In-Memory cache SQL query execution has L1 dcache prefetch per instruction executed of 25169253 / 7080326242 = 0.00355
- This gives an improvement ratio of 0.00355 / 0.00115 = 3.09
The End
Well
that's been a longer post than I wanted, but I wanted to get everything
together into one post as I'm sure Tanel will finish his series of
posts soon and probably make similar observations. And I wanted to see
if he would come to exactly the same conclusions or slightly different
ones, and doing this post was the only way to do that. Thanks again to
Tanel for doing the tests and sharing all the data with us.