## Tuesday, 2 March 2010

### Which Disk Is Faster?

Recently on a Solaris system I got the following disk statistics from “sar –d” (non-busy disks have been removed for clarity):
``device  %busy   avque   r+w/s  blks/s  avwait  avservsd3        82     0.9     134     879     0.0     6.6sd4        90     0.9     133     879     0.0     6.9sd5        73    17.1     777   12366     0.0    22.0``

You can draw conflicting conclusions from this data:
• On the one hand disk “sd5” seems to be performing slower than disk “sd4”, at 22.0 milliseconds per disk I/O request from Solaris versus only 6.9 milliseconds for “sd4”.
• But on the other hand “sd5” is clearly doing more work than “sd4” – 777 I/Os per second versus 133 I/Os per seconds (6 times more), and 12,366 blocks per second versus 879 (14 times more). Does this make “sd5” actually faster than “sd4” overall?
So which conclusion is right? Which of the two disks is actually faster than the other for the individual disk I/Os themselves, ignoring any time spent queuing before being executed? Clearly “sd5” will have longer queues than “sd4” because it is processing a far greater number of I/Os per second. This is confirmed by the average queue length value reported by “sar” – only 0.9 for “sd4” (less than 1 I/O at a time), while it is 17.1 for “sd5”.

I actually think that “sd5” is faster, given how many more disk I/Os it is doing per second, and the size of its average queue length. How can I prove this one way or the other? Well our old friend “Queuing Theory” can help, with its set of formulae describing how such things work.

A key point to realise is that modern disks have internal queues, and will accept more than one request from the operating system at a time. From the operating system’s perspective it can send a new I/O request to a disk before all the previous ones have finished. From the disk’s perspective it has an internal queue in front of the real disk, and the real disk can still only do one I/O at a time. We can see that this is the case in Solaris because the average queue length is 17.1 for “sd5”. Also the average wait time is 0.0 for all disks, because there is no waiting or queuing within Solaris, which is what this measures. Solaris was always able to immediately issue a new I/O request to the disk, and never exceeded any limit on concurrent requests to the disks.

So although “sd5” looks slow at 22.0 milliseconds service time, this is the full service time measured by Solaris, which includes any queuing time within the disk device itself. And with 17.1 concurrent requests on average, this could be quite a large queue, meaning that the 22.0 milliseconds reported by Solaris could include a significant amount of time waiting within the disk before the I/O was actually performed and the data returned.

Queuing Theory can help us “look inside the disk device” and see how big its queue is on average, and what the “real service time” of an I/O is within the disk when it performs it.

What do we know about the disks behaviour?
• Average completed requests per second are 777 for sd5 and 133 for sd4.
• External service times are 22.0 ms for sd5 and 6.9 ms for sd4
• Average requests in the disk device are 17.1 for sd5, and 0.9 for sd4
Even from this we should be able to see that 22.0 ms does not make sense for individual disk access times on sd5, because it managed to do 777 of them per second. Assuming that the disk was 100% busy during a one second interval, if it did 777 I/Os then each must have taken less than 1 / 777 = 0.001287 = 1.287 milliseconds. Which further confirms that the 22.0 ms reported by Solaris is mainly queuing time within the disk device itself.

We would like to know the actual service time within the disk, separate from the queue time within the disk. We can use a formula from Queuing Theory for this:
• S = R / (1 + N)
For this we need to know the response time from outside the disk i.e. from Solaris, and the number of overlapping concurrent requests on average (queue length) submitted to the disk. We have both of these from sar:
• sd5: S = 22.0 / (1 + 17.1) = 22.0 / 18.1 = 1.215 milliseconds
• sd4: S = 6.9 / (1 + 0.9) = 6.9 / 1.9 = 3.632 milliseconds
There we have it – disk sd5 has a far lower true service time than that of sd4. sd5 is actually almost 3 times faster than sd4 at performing each individual disk access! It is just the large queue of outstanding requests that causes the total disk access time as measured from Solaris to be so high at 22.0. We can now see that on average each disk I/O to sd5 spends (22.0 – 1.215) or 20.785 milliseconds waiting within the internal disk device queue before it is then executed, which then takes only 1.215 milliseconds.

In terms of the utilisation of each disk, the Queuing Theory formula is U = X * S, so:
• sd5: U = 777 * 0.001215 = 0.944 = 94.4%
• sd4: U = 133 * 0.003632 = 0.483 = 48.3%
This indicates that disk “sd5” is operating at a high utilisation level, and any increases in utilisation will lead to exponential increases in response time (service time as measured by Solaris). Disk “sd4” however is at just less than 50% utilisation, which correlates with the average queue length being just under 1 (0.9).

In this scenario I would suggest trying to move some of the I/O workload off “sd5” and onto some other disks somehow. Any reduction in the workload on “sd5” would dramatically reduce the number of concurrent requests (average queue length) and so dramatically reduce the service time as measured by Solaris. In other words, “sd5” is a hot and busy disk.