The system was a 2 node RAC cluster running 10.2.0.4.0. An AWR report from an hour of peak day time activity showed:
Elapsed: 59.58 (mins)
DB Time: 469.97 (mins)
So a database busy time of nearly 480 minutes in an elapsed period
of nearly 60 minutes, meaning an average of almost 8 active sessions
during that period. This is high for a system with only 4 CPUs. Theoretically there would only be 240 minutes of CPU capacity available on a 4 CPU node in a 60 minute period.The top 5 wait events were:
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time 8,450 30.0
log file sync 77,234 4,930 64 17.5 Commit
gc buffer busy 45,532 4,902 108 17.4 Cluster
db file sequential read 269,971 1,587 6 5.6 User I/O
gc cr block busy 45,388 1,401 31 5.0 Cluster
8,450 seconds is about 140 minutes. This is just over half of the
240 minutes of CPU capacity available on the 4 CPU node, indicating
that waiting is excessive i.e. the total Database Time is made up of
more waiting than actively executing SQL statements. And we can see
that the other 4 top wait events add to over 12,800 seconds, which is
far more than the CPU time - again more waiting than doing real work.As I said I initially focussed on trying to tune the "log file sync" events, but this lead nowhere. However a blog post I read when Googling this combination (sorry I cannot seem to find the link now) stated that in fact the log file sync is a consequence of the "gc buffer busy" event, and that there are other clues in the RAC specific section of the AWR report. And indeed on looking in the "RAC Statistics" under "Global Cache and Enqueue Services - Workload Characteristics" I saw the following:
Avg global enqueue get time (ms): 0.6
Avg global cache cr block receive time (ms): 11.2
Avg global cache current block receive time (ms): 7.2
Avg global cache cr block build time (ms): 0.0
Avg global cache cr block send time (ms): 0.2
Global cache log flushes for cr blocks served %: 10.8
Avg global cache cr block flush time (ms): 72.4
Avg global cache current block pin time (ms): 0.1
Avg global cache current block send time (ms): 0.2
Global cache log flushes for current blocks served %: 0.5
Avg global cache current block flush time (ms): 227.5
The stand out statistic is the last one - 227.5 milliseconds on average
to flush a "global cache current block". This is way too high. I
would expect an inter-node network message to be of the order of a few
milliseconds, and anything involving disk writes to be of the order of
tens of milliseconds. So hundreds of milliseconds is way too high.Further digging around revealed that in some cases a Global Cache request for a block from another node may cause that node to flush its work so far to its redo log before the block is copied across to the other node i.e. a "gc buffer" request from one node may cause a "log file sync" to occur on the other node. And I found this confirmed in this Pythian Blog post on Tuning "log file sync" Event Waits
In a RAC cluster, this problem is magnified, since both CR- and CUR-mode buffers need the log flush to complete before these blocks can be transferred to another instance’s cache.At this point I switched focus onto the causes of those Global Cache requests and the high inter-node traffic.
In the "SQL Statistics" section of the AWR report is "SQL Ordered by Cluster Wait Time", and I could see a set of SQL statements all acting on the same table and with very high Cluster Wait Times as a percentage of elapsed execution time (CWT%). The percentage cluster wait time was around 85% for all of these related SQL statements i.e. it was spending almost 6 times longer waiting than doing the actual work of the SQL statement.
All of these SQL statements acted on one table that was being both heavily inserted into, and selected from. This is kind of the classic "hot table" or "hot block" problem seen in some OLTP applications on RAC databases. This was further confirmed by the "Segments Statistics" section of the AWR report, in "Segments by Global Cache Buffer Busy". The only objects mentioned are the one table and some of the indexes on it.
In this case there were 2 suggestions I could make:
- Make
the primary key unique index a Reverse Key index, so that new records
would not be inserted into the same last index leaf block.
- The primary key on this table was a generated sequential number, so it will always be increasing in value, and inserted at the end of a normal index.
- Create a new multi-column index on the table that would better satisfy most of the queries being run.
- The
current execution plan used a less selective index and had to also
retrieve a number of data blocks from the table to further filter the
data according to the constraints in the "
WHERE
" clause.
- The
current execution plan used a less selective index and had to also
retrieve a number of data blocks from the table to further filter the
data according to the constraints in the "
Suffice it to say that performance improved significantly and the users were happy after these changes. The "Avg global cache current block flush time" came down to under 5 milliseconds, which is a significant reduction. The original high wait events all but disappeared, so that disk I/O became the top wait event, and the total Database Time went down significantly because it was no longer waiting so much and so was executing more efficiently.
The key lesson for me was that on a RAC database high inter-node global cache traffic can give rise to redo log buffer flushes and consequently high "log file sync" wait times. And this is primarily due to a high volume of blocks being continually shared between the nodes in the cluster. Rather than trying to directly address the log file sync event, you instead need to address the global cache buffer sharing and reduce the total number of blocks being shared somehow. In my case I was able to create a better index that enabled the execution to access fewer data blocks in total, reducing the global cache traffic volume and contention for hot blocks.