Tuesday 29 December 2015

GRUB, os-prober and Red Hat / Oracle Linux

I've been successfully using VirtualBox to have test environments to run Oracle Linux and Oracle Database in from some time, but there are limitations to what you can do. So I decided that I wanted to install Oracle Linux onto another disk partition on my PC so I could dual boot into it for some more advanced Oracle Database tests. Well the Oracle Linux installation itself went ahead trouble free - I just had to remember to disable GRUB bootloader installation as GRUB was already installed from my other Linux (Arch Linux) - but then I ran into some problems trying to get this newly installed Linux added properly to the main GRUB boot menu I was using. This post describes why this occurred (after much digging around on my part), and a very quick and simple solution for it.

Detecting other Linux installations on other disk partitions to add to the GRUB boot menu is done on Arch Linux by installing the "os-prober" package which adds some extra scripts used by "grub-mkconfig". The specific problem is that while "os-prober" did detect the Oracle Linux kernels, they were being added to the GRUB menu in the wrong order - it looked like an alphabetical ordering rather than a numeric ordering by kernel version number. This meant that the first Oracle Linux kernel listed in the GRUB menu was not the latest one installed, and in fact more likely to be the oldest one instead or a rescue kernel.

To cut a long story short the problem is due to a combination of the code in the "/usr/lib/linux-boot-probes/mounted/40grub2" detection script and the contents of the "/boot/grub2/grub.cfg" file in the Oracle Linux installation. The "grub.cfg" file in the Oracle Linux installation uses some keywords that are not detected by the "40grub2" script in Arch Linux, so the bootable Linux kernels are not listed in the same order as they are in the source "grub.cfg" file. Instead it is the "90fallback" script that detects the bootable Linux kernels when it is run afterwards by "os-prober". Actually it is run by "linux-boot-prober" and it does a direct listing of Linux kernel files in the "/boot" directory of the other Linux disk partition, and adds each of these to the local GRUB configuration file. And the result of this is that the other Linux kernels are detected and added in alphabetical order.

Details on the Problem

The "40grub2" script works by opening the "/boot/grub2/grub.cfg" file from another Linux installation and looking for the entries for bootable Linux kernels. The idea is that "40grub2" will find Linux kernels in the same order they are in the "grub.cfg" on the other Linux installation, and they will be added to the local "grub.cfg" file in the same order. The benefit of this method is that the first Linux kernel listed for this other installation in the main GRUB boot menu will be the same one as listed by the other Linux installation itself. Which in turn means that if it sorts the Linux kernels in any way or puts a specific Linux kernel first as the default in its "grub.cfg" configuration file, then this is also reflected in the local GRUB configuration file of my main Linux installation.

The "40grub2" script works by opening the "/boot/grub2/grub.cfg" file of the other Linux installation and then reads each line in turn looking for ones that begin "menuentry", "linux" or "initrd". I believe that these are "standard" keywords that GRUB should be using. Unfortunately Oracle Linux is using keywords of "linux16" and "initrd16" instead, which means that these lines are not matched at all by the "40grub2" script and no bootable Linux kernels are matched at all. It seems that Red Hat on which Oracle Linux is based uses these keywords for some obscure, historical reason or other. Either way, they are used and they do not match what "40grub2" is looking for.

Instead the bootable Linux kernels are detected by the "90fallback" script when it runs afterwards, and they are detected in alphabetical naming order as mentioned before.

Solutions

There is a quick, easy and good enough solution you can do yourself, and then there is a more official solution.

First, you can just manually edit your local "40grub2" file and change two lines in it. Add a "16" variation to the lines in the "case" block that test for "linux" and "initrd". Here is the output from "diff" showing the before (<) and after (>) versions of the two lines I changed.

67c67
<    linux)
---
>    linux | linux16 )
80c80
<    initrd)
---
>    initrd | initrd16 )
Once edited run "grub-mkconfig" again to regenerate your "grub.cfg" file, and it should correctly pick up those entries from the other Linux installation now.

Second, it does not look like there is actually an official solution, which can often be the case with open source software. I found some bug reports about this problem but there was some finger pointing going on both ways between the GRUB people and the Red Hat people. It looked like the GRUB people felt that the official keywords were "linux" and "initrd", so it was a Red Hat problem to solve; while the Red Hat people felt that "linux16" and "initrd16" were valid in a GRUB configuration file and did work so it was a GRUB problem with the "40grub2" script.

One person did raise the question on how the main Linux that is adding these entries to its local "grub.cfg" file should be treating these entries with the "16" suffix from the other Linux. Should it ignore them and just use the normal keywords in its own "grub.cfg" file, or should it use exactly the same keywords? The latter solution is a problem because the keywords found in the other "grub.cfg" file are NOT returned back to the "os-prober" script i.e. it is assumed they are only "linux" and "initrd". Making "40grub2" return these extra keywords as extra data fields would need a lot of changes in other places - both "40grub2" and "os-prober" at least, and possibly others too if there is a common format used for passing around information on bootable Linux kernels.

So you can see how something that looks simple can grow into something much bigger, and could have significant changes to something as important as GRUB. And GRUB is a very critical piece of software used at system boot time, so no "obvious solution" should be rushed through without a lot of extra thought and testing. Unfortunately I don't know when we will get any kind of "official solution" to this.

Monday 28 December 2015

Oracle In Memory and CPU Execution Efficiency

Tanel Poder has been doing a series of posts on "RAM is the new disk" trying to show how the new "Oracle Database In-Memory" feature in Oracle 12c uses memory in a different and more efficient way than the normal, traditional buffer cache does. He hasn't finished the series of posts yet, but I started to draw some of my own conclusions from the data he published in his last post and I'd thought I'd publish my thoughts.

I know a reasonable amount about how CPU's work internally, so I assumed that the main performance improvement of Oracle Database In-Memory would be from a better memory access pattern that would significantly reduce the number of execution stalls inside the CPU. Tanel is one of the good Oracle bloggers, and he has been sharing the details of the tests he has been doing in this series of blog posts as well as the results from them so I can see if my hypothesis on the new Database In-Memory holds up.

Tanels' Tests

The main results are in this post where he gives the 6 SQL queries he ran and their results in terms of elapsed time and CPU execution cycles - some against the traditional buffer cache and some against the new In-Memory cache. No disk accesses were involved as all the data was already in memory in the Oracle SGA.

I'm only interested in the main 2 SQL queries that are the same as each other, except the first one gets its data from the traditional data block based buffer cache in memory while the second gets its data from the new In Memory column storage based cache i.e. both execute the same "SELECT" against the same table and all of the data needed is already in memory in the SGA.
3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)
SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid)  
FROM customers_nopart c WHERE cust_id > 0;

4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0
SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

Tanel has also made all the low level CPU execution measurements available in a Google online spreadsheet. Here is a cut and paste of the CPU measurements for just the 2 queries I am interested in:

MetricTABLE BUFCACHETABLE INMEMORY PRED
task-clock-ms273741578
cycles864286530404573793724
instructions321154128777080326242
branches7386220210940579984
branch-misses220563974637243
stalled-cycles-frontend766970494202251325295
stalled-cycles-backend586273933951328333827
cache-references25644038411507915
cache-misses2220369817316366
LLC-loads2343611899712269
LLC-load-misses2185702947272805
LLC-stores184935821697666
LLC-store-misses323323127797
L1-dcache-loads73249460421069917316
L1-dcache-load-misses30527634185368159
L1-dcache-prefetches3689030225169253

The most obvious observation is that the In-Memory query execution is 17.34 times faster than the traditional buffer cache query execution (elapsed task-clock-ms time ratio of 27374 / 1578). I'm most interested in trying to explain what has caused this difference in elapsed time.

My Hypothesis & CPU / Memory Access

At the beginning of this post I stated that I assumed that the main performance improvement would be a result of a reduction in the number of execution stalls inside the CPU. Let me try and explain what I mean by that.

In the very old days of Intel CPU's (from the original 8086 up to the first '486 I think) the internal clock of the CPU (the rate at which it executed instructions) was the same as the external clock of the system which also governed memory access times. A read from the CPU of a memory location would take something like 3 or 4 external clock cycles (put the address on the system bus, the memory to process that to its output, put the data out on the bus to the CPU). During these clock cycles the CPU cannot proceed with the execution of that instruction i.e. it stalls until the data it needs from memory has been read into the CPU.

During the '486 lifetime Intel introduced technology that allowed the internal clock of the CPU to run faster than the external clock i.e. at a multiple of the external system clock. Now we have external clocks of 100 MHz typically and CPU internal clocks of 3 GHz and more i.e. the CPU is running internally at 30 or more times the external system clock rate.

A side effect of this change is that now when a CPU "stalls" to read needed data from memory, the stall is for much longer in terms of CPU instruction cycles. As the external clock is now 30 or more times slower than the internal clock of the CPU, the CPU may end up waiting doing nothing for over 100 execution cycles. NB. This is a simplification of what really happens, but it does describe a real effect.

We can see this effect in the separately reported statistics for "CPU cycles" and "CPU instructions completed" - the first one will be much higher than the second one. The "efficiency" of how a CPU executed a given piece of program code can be seen from the calculated "instructions per cycle" value i.e. "CPU instructions completed" divided by "CPU cycles". The closer this value is to 1 the more efficient the execution has been within the CPU. The lower this value is then the more "stalls" occurred within the CPU during execution that stopped it completing execution of an instruction every clock cycle.

Of course CPU designers have tried to work around or reduce the effect of such stalls in various ways, the details of which are not relevant here. It just means that a) there is various technology in the CPU to try and minimise the occurrence and impact of such "memory load stalls", and b) the net impact of such "memory load stalls" is much less than the 100 or so wasted execution cycles I've described. In fact, and this is also important, modern CPU's can actually issue more than one instruction at the same time for execution within the same clock cycle e.g. an integer operation and a floating point operation. This can result in an "instructions completed per clock cycle" value larger than 1, which can occur when executing well optimised program code.

If my hypothesis is correct then we should see a difference in the number of "stalls" in the CPU execution statistics between the 2 SQL query executions, and the ratio between these 2 sets of "stall" statistics should be close to the observed ratio in execution times.

My Initial Conclusions

To cut to the chase a bit, the observed difference in execution times between the 2 SQL queries is not wholly explained by just a reduction in the number of "stalls" inside the CPU, but the reduction in "stalls" is a significant contributor to this reduction in elapsed time. The reduction in elapsed time is roughly a 50:50 split between a reduction in the total number of instructions executed (code efficiency) and CPU memory access stalls (memory access efficiency).

We can see from the measurements that Tanel has reported that the number of instructions executed by the CPU decreased by a factor of 4.54 (32115412877 / 7080326242) to process the same amount of data (rows in the table). This is a significant improvement in code efficiency to process this volume of data - probably a result of the In-Memory cache using far simpler internal data structures than the traditional buffer cache that can in turn be processed much easier.

This leaves an improvement factor of 3.82 (17.34 / 4.54) that needs to be explained by something else. I propose that this other factor is due to a reduction in the number of "stalls" within the CPU when executing these instructions as described earlier. This will have the net effect of increasing the "instructions per cycle" value during the execution of the SQL query, and indeed we can see this from the measurements reported by Tanel:
  • Buffer cache SQL query execution has CPU instructions per cycle of 32115412877 / 86428653040 = 0.37
  • In-Memory cache SQL query execution has CPU instructions per cycle of 7080326242 / 4573793724 = 1.55
  • This gives an improvement ratio of 1.55 / 0.37 = 4.17
And this value is very close to the 3.82 value for the other performance improvement factor.

Trying to drill down further into this difference in "instructions per cycle" and identifying "CPU memory stalls" due to data access is difficult if not impossible for a number of reasons:
  • Above the L1 cache on the CPU the other CPU caches are shared by both instructions and data. We cannot tell which cache reads and which cache misses were due to fetching an instruction to execute or fetching a piece of data needed by an instruction being executed. Both lead to CPU stalls but for different reasons.
  • The In-Memory cache SQL query is executing a completely different code path within Oracle with almost 78% fewer instructions being executed in total (100 / 4.54). So we cannot assume that the instruction execution profile and also instruction cache misses are similar in both cases.
We can however see other evidence that this other factor for the improvement in performance is due to a reduction in CPU stalls. Tanel has reported two other CPU execution statistics - "stalled-cycles-frontend" and "stalled-cycles-backend". Tanel gives descriptions for these in his part 2 post, which I will summarise as follows:
  • Stalled cycles front end occurs before an instruction gets executed by the CPU and is generally due to fetching an instruction from memory
  • Stalled cycles back end occurs during instruction execution and is generally due to fetching data from memory needed by that instruction
So the most relevant statistic for stalls due to reading data from memory is "stalled-cycles-backend". We cannot compare these directly between the two SQL query executions because they executed a different number of instructions (remember the 4.54 reduction factor). What we can do is normalise these to a "stalled cycle per cycle" value and then compare them:
  • Buffer cache SQL query execution has backend stalls per cycle of 58627393395 / 86428653040 = 0.68
  • In-Memory cache SQL query execution has backend stalls per cycle of 1328333827 / 4573793724 = 0.29
  • This gives an improvement ratio of 0.68 / 0.29 = 2.34
Again, this is a significant improvement and shows that the number of stalls due to memory accesses by the CPU for data has been reduced.

Why Have Stalls Reduced?

I hope that I have shown that part of the improvement in the performance of the In-Memory cache has been the reduction in the number of "memory access stalls" in the CPU when executing the SQL query i.e. less time wasted waiting to read data in from main memory with a corresponding increase in the number of instructions executed per clock cycle and less total elapsed to execute those instructions. But how has Oracle actually managed to achieve this? How has it managed to reduce the number of "memory access stalls" while executing the same SQL query i.e. if it needs to read the same "set of data" to perform the same SQL query, how has it managed to read it more efficiently?

I am guessing that this is due to two specific factors:
  1. Data layout in memory - the layout is more "compact" which decreases the total amount of memory that must be read
  2. Sequential access to that memory, causing the CPU to "prefetch" the next memory line into the CPU cache, so it is already there when it is needed i.e. overlapping the processing of one memory line with the fetching of the next memory line
Note that this is based on what I have read on In-Memory as I've not yet had a chance to do any real testing of it yet. Again, thanks to Tanel for both doing some real testing and for sharing all the results he got from those tests.

In the traditional buffer cache a data block will span multiple memory lines (assuming a 64 byte memory cache line, as Tanel states) and this will contain data for all the columns in a row. When you only need the data from one column it means that you are skipping about in memory quite a lot, jumping from row to row, and not using most of the data in each block or memory line. In the new In-Memory cache the data is stored by column, so all the data for one column is stored together, next to each other in memory. This immediately reduces the total amount of memory that must be read to access the data for one column across all the rows in the table.

It also will increase the cache efficiency within the CPU, because one memory line in the cache will contain data for multiple rows i.e. the CPU can process the data for that column from multiple rows before it needs to read in another memory line into cache to get the next set of rows' data.

If the data for a column is physically laid out sequentially and contiguously in the system's memory, then it means that Oracle will be doing a series of sequential memory reads to process all the data for one column. Modern CPU's can detect this kind of sequential memory access and have an optimisation to "prefetch" the next data line from memory while the previous data line is still being processed. The net result is that stalls are either eliminated because the next line of data is already in the CPU cache or the impact of stalls is significantly reduced because the data is already on its way from memory to the CPU cache.

We can see evidence for this occurring in the "L1-dcache-prefetches" CPU statistic. The "L1 data cache" is the first on chip cache immediately used by the CPU core, and is small but very fast. All data needed by the CPU goes through this cache. Again, normalising to a "prefetch per instruction executed" we can see that the number of data prefetches by the CPU increased significantly during the In-Memory cache SQL query execution:
  • Buffer cache SQL query execution has L1 dcache prefetch per instruction executed of 36890302 / 32115412877 = 0.00115
  • In-Memory cache SQL query execution has L1 dcache prefetch per instruction executed of 25169253 / 7080326242 = 0.00355
  • This gives an improvement ratio of 0.00355 / 0.00115 = 3.09

The End

Well that's been a longer post than I wanted, but I wanted to get everything together into one post as I'm sure Tanel will finish his series of posts soon and probably make similar observations. And I wanted to see if he would come to exactly the same conclusions or slightly different ones, and doing this post was the only way to do that. Thanks again to Tanel for doing the tests and sharing all the data with us.

Monday 14 December 2015

Python for the DBA (2) - Data Unloading

I've mentioned before that I like the Python programming language. Not only is it a "clean" language for writing programs in with a rich set of data types and structures for manipulating data, it also has a standard API for database access making it "database neutral". This makes it relatively easy to get data that is in a database into and out of your Python code where you can use the power of Python for whatever kind of data manipulation you need.

To show how easy it is to get data into a Python program I'll show a little utility to unload data from one table in a database to a flat, text file as one record per line with each data field separated by special delimiters.

Assuming the following:
  • Your Python program already has a database connection
    • This is relatively easy, but does vary by specific database driver being used
  • And it has opened the output file to unload to i.e. an open file handle and not the file name
  • And it has the name of the table, the field delimiter characters and the record terminator characters in local variables
Then you can write a simple function to call that will unload the data from that table to the specified file:
def unload_table_to_file (db_conn, table_name, out_data_file, field_separator, record_terminator) :
    # Preparation steps before doing the real work
    # Replace literals of special characters in the separator and terminator
    record_terminator = record_terminator.replace ('\\n', '\n')
    field_separator = field_separator.replace ('\\t', '\t')
    
    # Simplest SQL to get the data, assume column ordering is correct
    select_SQL = "select * from " + table_name
    
    # Now do the work - read in all records and output them
    print ("=== Unloading . . .")
    select_cursor = db_conn.cursor ()
    select_cursor.execute (select_SQL)
    for data_row in select_cursor.fetchall () :
        # Convert Python native data types to simple string representations
        output_fields = []
        for data_field in data_row :
            output_fields.append (str(data_field))
        out_data_file.write (field_separator.join (output_fields) + record_terminator)
    print ("=== Finished")
The main lines are:
  • First make sure that if any "escaped characters" are in the delimiters then they are the real escape characters
    • "\t" means the TAB character, and "\n" means the NEWLINE character
  • We just use a "select * from table" to get all the data into the Python code
    • The String concatenate operator is just the "+" operator i.e. "+" is overloaded depending on data type
  • And we execute this "select" via a Cursor created from the database connection passed in
  • The real work happens in 2 nested loops - for each data row, for each data field in each data row
    • The fields from one record are put into a list (essentially equivalent to an array)
    • This is so we can use a single action at the end to concatenate them all together with each data field value separated by the "field separator"
    • Then we "write" out this concatenated list of fields to the output file
  • The "str" function will return the String representation of a data value e.g. convert an integer to its String representation
    • This assumes that there exists a data type specific "str" method for each possible data type
  • The "join" method of a String variable iterates over a list of data, and returns one String of the list contents concatenated together with the value of the String variable between each one.
I think this shows some of the power and elegance of Python that reading all the data from a table can be done with so few lines of code, and that the code itself is so readable. And this code is flexible enough to deal with different values for the field separator characters and the record terminator characters. You could just do a classic comma separated list (',' as the field separator) or something more complicated to avoid issues where a comma may appear in the data itself e.g. "|__|" as the field separator (vertical bar, underscore, underscore, vertical bar). And this flexibility is all handled by this Python code.

Caveats

This is probably the most simple Python code to achieve the goal of reading all the data in a table and writing out to a flat, text file. There are several things it does not do at all, and other things that could be enhanced with it.
  • It is assumed that all data types being used can be cleanly converted to valid String representations using "str()"
    • This may not be true for some data types, or you may wish to handle some data types differently e.g. Dates
  • Dates should probably be handled differently, to ensure the full date and time values are output
    • You could execute an "ALTER SESSION" SQL statement to set the value of NLS_DATE_FORMAT to something suitable
    • Python has a native Date / Time data type, "datetime", which Oracle DATE columns are mapped to, and you could test for this
  • You could add extra functionality such as debugging / tracing messages (use the "logging" module) or other progress messages
    • You may wish to add extra counters to count the number of records and fields per record