Thursday 23 January 2014

RAC high block sharing causes high log sync times

Some time ago I dealt with a system that was experiencing high "log file sync" times, as well as other RAC related waits ("gc buffer busy" and "gc cr block busy"). Initially I assumed that the "log file sync" waits and the RAC Global Cache waits were independent of each other, but it turned out that they were interlinked. This post is mainly to make other people aware that on a RAC system there is a relationship between high inter-node block sharing traffic (Global Cache) and "log file sync", and that the log file sync events are a direct side effect of the high inter-node block sharing that is occurring. I'll also provide some evidence to support this in terms of AWR reports from the system.

The system was a 2 node RAC cluster running 10.2.0.4.0. An AWR report from an hour of peak day time activity showed:
Elapsed:   59.58 (mins)    
DB Time:  469.97 (mins)
So a database busy time of nearly 480 minutes in an elapsed period of nearly 60 minutes, meaning an average of almost 8 active sessions during that period. This is high for a system with only 4 CPUs. Theoretically there would only be 240 minutes of CPU capacity available on a 4 CPU node in a 60 minute period.

The top 5 wait events were:
Event                     Waits  Time(s)  Avg Wait(ms) % Total Call Time  Wait Class
CPU time                          8,450                              30.0  
log file sync            77,234   4,930         64                   17.5  Commit
gc buffer busy           45,532   4,902        108                   17.4  Cluster
db file sequential read 269,971   1,587          6                    5.6  User I/O
gc cr block busy         45,388   1,401         31                    5.0  Cluster
8,450 seconds is about 140 minutes. This is just over half of the 240 minutes of CPU capacity available on the 4 CPU node, indicating that waiting is excessive i.e. the total Database Time is made up of more waiting than actively executing SQL statements. And we can see that the other 4 top wait events add to over 12,800 seconds, which is far more than the CPU time - again more waiting than doing real work.

As I said I initially focussed on trying to tune the "log file sync" events, but this lead nowhere. However a blog post I read when Googling this combination (sorry I cannot seem to find the link now) stated that in fact the log file sync is a consequence of the "gc buffer busy" event, and that there are other clues in the RAC specific section of the AWR report. And indeed on looking in the "RAC Statistics" under "Global Cache and Enqueue Services - Workload Characteristics" I saw the following:
Avg global enqueue get time (ms):                       0.6
Avg global cache cr block receive time (ms):           11.2
Avg global cache current block receive time (ms):       7.2
Avg global cache cr block build time (ms):              0.0
Avg global cache cr block send time (ms):               0.2
Global cache log flushes for cr blocks served %:       10.8
Avg global cache cr block flush time (ms):             72.4
Avg global cache current block pin time (ms):           0.1
Avg global cache current block send time (ms):          0.2
Global cache log flushes for current blocks served %:   0.5
Avg global cache current block flush time (ms):       227.5
The stand out statistic is the last one - 227.5 milliseconds on average to flush a "global cache current block". This is way too high. I would expect an inter-node network message to be of the order of a few milliseconds, and anything involving disk writes to be of the order of tens of milliseconds. So hundreds of milliseconds is way too high.

Further digging around revealed that in some cases a Global Cache request for a block from another node may cause that node to flush its work so far to its redo log before the block is copied across to the other node i.e. a "gc buffer" request from one node may cause a "log file sync" to occur on the other node. And I found this confirmed in this Pythian Blog post on Tuning "log file sync" Event Waits
In a RAC cluster, this problem is magnified, since both CR- and CUR-mode buffers need the log flush to complete before these blocks can be transferred to another instance’s cache.
At this point I switched focus onto the causes of those Global Cache requests and the high inter-node traffic.

In the "SQL Statistics" section of the AWR report is "SQL Ordered by Cluster Wait Time", and I could see a set of SQL statements all acting on the same table and with very high Cluster Wait Times as a percentage of elapsed execution time (CWT%). The percentage cluster wait time was around 85% for all of these related SQL statements i.e. it was spending almost 6 times longer waiting than doing the actual work of the SQL statement.

All of these SQL statements acted on one table that was being both heavily inserted into, and selected from. This is kind of the classic "hot table" or "hot block" problem seen in some OLTP applications on RAC databases. This was further confirmed by the "Segments Statistics" section of the AWR report, in "Segments by Global Cache Buffer Busy". The only objects mentioned are the one table and some of the indexes on it.

In this case there were 2 suggestions I could make:
  1. Make the primary key unique index a Reverse Key index, so that new records would not be inserted into the same last index leaf block.
    • The primary key on this table was a generated sequential number, so it will always be increasing in value, and inserted at the end of a normal index.
  2. Create a new multi-column index on the table that would better satisfy most of the queries being run.
    • The current execution plan used a less selective index and had to also retrieve a number of data blocks from the table to further filter the data according to the constraints in the "WHERE" clause.
These 2 changes reduced both the number of data blocks being accessed in the table for each query execution, and reduced specific contention on the last block in the primary key index. The net effect was to reduce the total number of blocks being continually shared between the nodes in the Global Cache for these SQL statements.

Suffice it to say that performance improved significantly and the users were happy after these changes. The "Avg global cache current block flush time" came down to under 5 milliseconds, which is a significant reduction. The original high wait events all but disappeared, so that disk I/O became the top wait event, and the total Database Time went down significantly because it was no longer waiting so much and so was executing more efficiently.

The key lesson for me was that on a RAC database high inter-node global cache traffic can give rise to redo log buffer flushes and consequently high "log file sync" wait times. And this is primarily due to a high volume of blocks being continually shared between the nodes in the cluster. Rather than trying to directly address the log file sync event, you instead need to address the global cache buffer sharing and reduce the total number of blocks being shared somehow. In my case I was able to create a better index that enabled the execution to access fewer data blocks in total, reducing the global cache traffic volume and contention for hot blocks.

2 comments:

Unknown said...

Very helpful post!
I have a so good reference for the very issue that I now having been in and undergo.. Thank you.

Unknown said...

Glad that we're not the only one who are saying this. The problem gets even worsen when updates start to occur on blocks which have to travel between nodes.

Solution ... proper application design.
Didn't find a "one solution fits all" for this behavior on rac + sync dataguard + updates on both nodes.