Monday, 29 March 2010

DTrace - Great if you can get it to work at all

On the face of it I think DTrace is a phenomenal piece of technology from Sun. That on the fly you can turn on instrumentation on all kernel events and system calls, and write your own scripts to count calls, sum elapsed time, or look at which files had the most I/Os to them, and much more. All implemented with no external change to Solaris itself - all existing commands and utilities run the same - and minimal impact on performance when not enabled. Truly amazing.

For the record I should confess that I was a Sun employee when DTrace was initially announced. However I was an Oracle Consultant and had nothing to do with Solaris kernel development or DTrace. My viewpoint was and remains that of a user of Solaris, and I did not gain any inside knowledge about DTrace one way or another.

I did see the demonstrations that are commonly repeated, where they start off a rogue process and then use DTrace scripts to see what is going on, and which process is causing all the extra work on the system. It looks so simple and so direct - a few iterations and you have identified the process in question and got some of the details of what it is doing.

I have been waiting since then to use DTrace in anger on some real life investigations. It has taken some time for Solaris 10 to be fully rolled out by most organizations. And that opportunity just occurred the other day. A chance to use DTrace to see what an application was doing, and in which functions it was spending most of its time.

A quick peruse of the Web finds the DTrace Toolkit, in which I find something called dappprof.d which does something similar to what I want - elapsed time per function call. I write my own D script for what I want, using similar probes and timestamp calculations, and am ready to look inside this application. Problem solved? No!

Unfortunately, DTrace does not work! Or at least that is what seems to be happening when I try and use it. Which is why I am writing this blog post. Sorry to be so negative, but either it does what it is supposed to do or it does not. And unfortunately it simply does not do what it is supposed to. And I've just spent 2 days banging my head against a brick wall trying to get DTrace to do what it should be doing. The good news is that there is an "answer" to the problem I had, which I'll get to later on. But this could all have been avoided if some of the "details" of DTrace had been done properly by Sun - like useful and helpful error messages, and configuration options that made sense.

First, unlike other software products which tell you WHY something went wrong so that you can fix it, DTrace simply says "I'm not going to do that" and stops. Shades of HAL from 2001: A Space Odyssey spring to mind. Generally whenever something goes wrong with Oracle or a programming language compiler I get a specific message - "syntax error at line xxx", "missing keyword expected", "out of memory". With DTrace you get a message that amounts to "I did not like that so I am stopping" one way or another.

My first error message was:
dtrace: processing aborted: Abort due to systemic unresponsiveness

I managed to find a blog entry by Jon Haslam of Sun, that says that there is nothing you can do about this, other than disabling certain protections within DTrace. And generally speaking those protections are there for a reason, so it is not recommended to disable them.

Okay, maybe I was sampling too often, or matched too many probes, or something. As I say, DTrace never tells you WHAT caused the failure. So I cut down my script and instead of 20,000 probes I now match less than 100. Just 33 in fact. Does it run properly? No! Still the same problem:
dtrace: script './gpsmintrace.d' matched 33 probes
dtrace: processing aborted: Abort due to systemic unresponsiveness

The script only has 4 probes as far as I am concerned, even though DTrace ends up matching more than this:
pid$target:a.out::entry
pid$target:a.out::return
profile:::tick-5sec
dtrace:::END

So the entry and exit points of functions in the program file itself (a.out), every 5 seconds, and finally at the end of the script when it stops. How can these be causing "systemic unresponsiveness?".

I try different combinations - a different process, take out the tick-5sec probe, name the functions to match so only "11 probes" matched - and always the same result:
dtrace: processing aborted: Abort due to systemic unresponsiveness

At this point the only conclusion I can come to is that DTrace is useless. What else can I conclude? Every time I run my minimal D script it sits there with no output messages and then aborts. Why do I not see the output from the 5 second ticks? Why does it never work?

As you can imagine I am just about ready to give up on DTrace - forever. I have been through the trouble of designing and writing an initial D script to help investigate my application behaviour, and have requested the appropriate privileges from the system administrators - dtrace_user and dtrace_proc - but have achieved nothing from 2 days of continuous trying. After 2 days I have got zero information out of DTrace, useful or not.

I'm ready to throw in the towel and give up on DTrace as another piece of great but fundamentally flawed Sun technology when I have one more thought. Why are the 5 second tick probes never firing? Why do I never see any output from them? What if the tick probes never fired? Wouldn't my running totals in the aggregations get very large? Could this be causing some kind of overflow and DTrace to simply give up?

A quick Google on the terms "dtrace tick probe privilege" turns up an entry at http://www.mail-archive.com/dtrace-discuss@opensolaris.org/msg02621.html with the same symptoms - the tick probe never fires. And the solution is confirmed in the reply http://www.mail-archive.com/dtrace-discuss@opensolaris.org/msg02622.html - you need the "dtrace_kernel" privilege for the tick probes to fire and output anything. So I make another request to the system administrators for this privilege, and voila! My D script now works as it should do.

So it seems that to do anything useful with DTrace you will need the maximum privilege level of dtrace_kernel, regardless of what the manual tells you differently. Yes, to "use" the tick probe of the profile provider in a D script you need the "dtrace_user" privilege. However, such a defined tick probe will never fire unless you have the "dtrace_kernel" privilege. The "dtrace_user" privilege gives you the "right" to use a tick probe, but not the "access" to that probe at run time. You need the "dtrace_kernel" privilege too for the run time "access" to the tick probe.

I've a feeling that this is extremely poorly designed and documented by Sun. It is not documented in the standard DTrace manuals - I know I looked very closely at them on how to make the tick probe work. So it must be a common problem that people run into time and time again on production systems when trying to work with the least necessary privileges. Yet my Google search only found one explicit reference to this, on a forum posting by someone with the same problem. All the other hits on "dtrace tick probe privilege" were standard descriptions of using probes in DTrace. Nothing from Sun themselves on their own web sites.

Wednesday, 10 March 2010

OpenSolaris is not Solaris

I've been playing with an installation of OpenSolaris on my Intel x86 PC within a VirtualBox virtual machine. (VirtualBox is good, but I'll say more about it separately). The good news is that I managed to install OpenSolaris (2009.06) easily enough and get logged into it for normal use. The bad news is that OpenSolaris is not Solaris, which introduces a whole bunch of incompatability issues. And given that I have done this in order to install and test Oracle, it creates a whole series of problems that would not occur if it was really Solaris.

What is OpenSolaris?
The simplistic answer is that OpenSolaris is an open sourced subset of Solaris i.e. the subset that Sun owned the rights to and could open source. But that is not the real truth. The real truth is that OpenSolaris is the open sourced next generation version of Solaris i.e. Solaris 11. And indeed the version number of OpenSolaris is 11 (or 5.11 technically speaking). And Sun have changed a lot of administration side things between Solaris 10 and 11. And they are still changing things, as OpenSolaris is a work in progress by Sun. A lot can change even between releases of OpenSolaris. So remember that OpenSolaris is not Solaris. There is no mention yet of any level of compatability between Solaris 10 and OpenSolaris.

Clearly OpenSolaris shares the same kernel core as Solaris 10. But outside of that, in terms of the various applications that can be installed, things change. A lot. Solaris has always had its own "packages" installed via the "pkgadd" command. However it lacked a GUI (at least I was not aware of any before now), and did not handle dependencies between packages (A uses things from B, so B should be installed first). You also had to obtain the package distribution file yourself somehow - download or CD media - before it could be installed.

The major Linux distributions have all addressed these issues, so that installing most application software is just a case of firing up the GUI utility, clicking on the application name and then the "Install" button. OpenSolaris has now tried to do the same thing, via its Image Packaging System (IPS), which seems to share the same concepts as Synaptic in Ubuntu. There are repositories of packages, and the packages include lists of dependencies on other packages. Now you can install a package via a few simple clicks, and the package distribution is downloaded from the repository.

Nice. Except ... That there is only one repository, being Sun's own repository of its packages included as part of Solaris. So there is no third party software to choose from. And Sun continues to name all its packages with a very short name and a prefix of "SUNW". Would you know that "SUNWarc" were the "Lint Libraries"? Or that the "Solaris Bundled tools" were in package "SUNWsprot", but that "SUNWbtool" was in fact the "CCS tools bundled with SunOS". Not clear or straightforward at least.

Which leads to the second problem. Even if you know that there is a package containing a program or utility you need, finding out what the name of the package is can be quite hard work. There is not a lot of useful information on OpenSolaris yet that you can find easily via Google. And Sun has provided very little documentation.

Okay, I'll just use the existing Solaris 10 packages? No! OpenSolaris is not Solaris, and even though you could try to install a Solaris 10 package, and it might work, it is not supported and the results are not guaranteed. This is because OpenSolaris is really Solaris 11, and things can change in the kernel and interface libraries. So you should only use OpenSolaris packages, and not Solaris 10 ones.

Are there any other repositories? Of any shape or form? Yes. There is one at Blastwave. Can I use its packages? Yes, but they are very old. Clearly someone did port various Open Source to earlier releases of OpenSolaris, and these have been gathered together at Blastwave. But when I went looking for some things, I could only find 3 year old versions from 2007. In the end I downloaded the source code to the current version of the utility I needed and compiled it all myself.

Oracle 10gR2 Issues
First there is the issue of required packages, and I've already discussed the new IPS on OpenSolaris. For Oracle 10gR2, as per the Oracle 10gR2 on Solaris installation notes, you will need to install:
  • SUNWarc SUNWbtool SUNWhea SUNWlibm SUNWlibms SUNWsprot SUNWtoo
but not:
  • SUNWi1of SUNWi1cs SUNWi15cs SUNWxwfnt
This is because these are closed source font libraries, and Sun cannot open source them or include them in OpenSolaris under its license terms. But these will not cause an issue, as other fonts will be automatically used instead.

You also need to install SUNWmfrun, which is the Motif Runtime libraries. Yes, the Oracle 10gR2 graphical utilities use Motif behind the scenes. And Motif is no longer installed as standard on OpenSolaris.

Then can we install Oracle 10gR2? No, because the Oracle installer sees the operating system version as 11 (or rather 5.11 as given by "uname -r") and stops because it is not certified against it. Which is not surprising, as you can only certify against officially released operating systems, and OpenSolaris is still a work in progress. But we will assume that OpenSolaris (11) is fully compatible with Solaris 10, and use the "-IgnoreSysPrereqs" to make Oracle ignore this mismatch in the operating system version.

Then the installer will fail with missing Java libraries e.g.
Exception java.lang.UnsatisfiedLinkError:
/tmp/OraInstall<...>/jre/1.4.2/lib/i386/motif21/libmawt.so:
ld.so.1: java: fatal: libdps.so.5: open failed: No such file or directory occurred..

This is because the Oracle installer is using Java 1.4.2 which is shipped as part of the Oracle 10gR2 distribution itself, but OpenSolaris does not have the right bits for backwards compatability with this, hence the missing library message. Again we can use an installer command line option to tell it to use the current Java Runtime Environment on OpenSolaris (1.6), which will not be missing any bits it needs:
  • ./runInstaller -IgnoreSysPrereqs -jreLoc /usr/java/jre
And voila, an Oracle installer window eventually appears and we can proceed with the installation.

Well, again no. There were various errors about certain files not existing. It seems that the installer cannot create symbolic links between files. Using an Oracle installation on another system I was able to work out what these files should be linked to, and manually did this at the end of the installation.

And finally, Yes, I was able to start Oracle and create a database.

Tuesday, 2 March 2010

Which Disk Is Faster?

Recently on a Solaris system I got the following disk statistics from “sar –d” (non-busy disks have been removed for clarity):
device  %busy   avque   r+w/s  blks/s  avwait  avserv
sd3 82 0.9 134 879 0.0 6.6
sd4 90 0.9 133 879 0.0 6.9
sd5 73 17.1 777 12366 0.0 22.0

You can draw conflicting conclusions from this data:
  • On the one hand disk “sd5” seems to be performing slower than disk “sd4”, at 22.0 milliseconds per disk I/O request from Solaris versus only 6.9 milliseconds for “sd4”.
  • But on the other hand “sd5” is clearly doing more work than “sd4” – 777 I/Os per second versus 133 I/Os per seconds (6 times more), and 12,366 blocks per second versus 879 (14 times more). Does this make “sd5” actually faster than “sd4” overall?
So which conclusion is right? Which of the two disks is actually faster than the other for the individual disk I/Os themselves, ignoring any time spent queuing before being executed? Clearly “sd5” will have longer queues than “sd4” because it is processing a far greater number of I/Os per second. This is confirmed by the average queue length value reported by “sar” – only 0.9 for “sd4” (less than 1 I/O at a time), while it is 17.1 for “sd5”.

I actually think that “sd5” is faster, given how many more disk I/Os it is doing per second, and the size of its average queue length. How can I prove this one way or the other? Well our old friend “Queuing Theory” can help, with its set of formulae describing how such things work.

A key point to realise is that modern disks have internal queues, and will accept more than one request from the operating system at a time. From the operating system’s perspective it can send a new I/O request to a disk before all the previous ones have finished. From the disk’s perspective it has an internal queue in front of the real disk, and the real disk can still only do one I/O at a time. We can see that this is the case in Solaris because the average queue length is 17.1 for “sd5”. Also the average wait time is 0.0 for all disks, because there is no waiting or queuing within Solaris, which is what this measures. Solaris was always able to immediately issue a new I/O request to the disk, and never exceeded any limit on concurrent requests to the disks.

So although “sd5” looks slow at 22.0 milliseconds service time, this is the full service time measured by Solaris, which includes any queuing time within the disk device itself. And with 17.1 concurrent requests on average, this could be quite a large queue, meaning that the 22.0 milliseconds reported by Solaris could include a significant amount of time waiting within the disk before the I/O was actually performed and the data returned.

Queuing Theory can help us “look inside the disk device” and see how big its queue is on average, and what the “real service time” of an I/O is within the disk when it performs it.

What do we know about the disks behaviour?
  • Average completed requests per second are 777 for sd5 and 133 for sd4.
  • External service times are 22.0 ms for sd5 and 6.9 ms for sd4
  • Average requests in the disk device are 17.1 for sd5, and 0.9 for sd4
Even from this we should be able to see that 22.0 ms does not make sense for individual disk access times on sd5, because it managed to do 777 of them per second. Assuming that the disk was 100% busy during a one second interval, if it did 777 I/Os then each must have taken less than 1 / 777 = 0.001287 = 1.287 milliseconds. Which further confirms that the 22.0 ms reported by Solaris is mainly queuing time within the disk device itself.

We would like to know the actual service time within the disk, separate from the queue time within the disk. We can use a formula from Queuing Theory for this:
  • S = R / (1 + N)
For this we need to know the response time from outside the disk i.e. from Solaris, and the number of overlapping concurrent requests on average (queue length) submitted to the disk. We have both of these from sar:
  • sd5: S = 22.0 / (1 + 17.1) = 22.0 / 18.1 = 1.215 milliseconds
  • sd4: S = 6.9 / (1 + 0.9) = 6.9 / 1.9 = 3.632 milliseconds
There we have it – disk sd5 has a far lower true service time than that of sd4. sd5 is actually almost 3 times faster than sd4 at performing each individual disk access! It is just the large queue of outstanding requests that causes the total disk access time as measured from Solaris to be so high at 22.0. We can now see that on average each disk I/O to sd5 spends (22.0 – 1.215) or 20.785 milliseconds waiting within the internal disk device queue before it is then executed, which then takes only 1.215 milliseconds.

In terms of the utilisation of each disk, the Queuing Theory formula is U = X * S, so:
  • sd5: U = 777 * 0.001215 = 0.944 = 94.4%
  • sd4: U = 133 * 0.003632 = 0.483 = 48.3%
This indicates that disk “sd5” is operating at a high utilisation level, and any increases in utilisation will lead to exponential increases in response time (service time as measured by Solaris). Disk “sd4” however is at just less than 50% utilisation, which correlates with the average queue length being just under 1 (0.9).

In this scenario I would suggest trying to move some of the I/O workload off “sd5” and onto some other disks somehow. Any reduction in the workload on “sd5” would dramatically reduce the number of concurrent requests (average queue length) and so dramatically reduce the service time as measured by Solaris. In other words, “sd5” is a hot and busy disk.