Sunday 8 March 2009

System Activity Data - What is the System up to?

I'm always amazed at how many people do not know about the SAR utility on UNIX and Linux systems. It is a great little utility for collecting a lot of data about what is going on on a system. No, it is not the best possible tool for all such information, but it comes as standard on most UNIX systems and will give you most of what you want for very little effort on your part.

The problem we often deal with is that when running an application and taking application level measurements we are not sure what is happening lower down on the system, under the covers as it were. When the application performance is slow it would be nice to know if the computer system itself was very heavily loaded too, or whether the cause of the slowness lies somewhere else. Such relatively simple data about activity and resource usage on the system can help us establish what class of problem we have - a saturated resource in the system itself, or a badly behaving application.

To know what the computer system is up to we need to know about all of the resources in it - CPU, Memory, and Disks. And we need several different measurements on each resource - number of requests, elapsed times, and queue lengths. We also need an accurate timestamp with each of these measurements, to know when it occurred, and the period of time it covered.

All of this can be relatively easily achieved with SAR - the System Activity Reporter. It will collect and record a wide range of measurements on system resources, and the time when they occurred.

Behind the scenes SAR uses SADC - the System Activity Data Collector. And in fact it is SADC we are more concerned with than SAR. SAR simply extracts and formats the data you want from an SADC data set. We can run SADC to collect all the data it can about the system, at the frequency we want it collected (measurement period) and for the duration we want (how long until). SADC has a very simple set of options to it:
  • /usr/lib/sa/sadc t n file
where
  • t is the time duration between measurements i.e. the frequency
  • n is the number of measurements to make in total
  • file is the name of a file to write these measurements to
To not have SADC tie up a terminal window, it is often run in the background with the '&' specifier at the end of the command line - control then returns to the terminal window. This can also be used in command files and scripts.

The file produced by SACD contains binary data, not text data, and needs SAR to read it and format it as readable text.
  • sar [-data_option] -f file
If no data option is specified it defaults to CPU (-u). The common data options are:
  • -u CPU
  • -d Disk I/O
  • -r Physical Memory
  • -p Virtual Memory Paging
Remember that the SADC file contains all the data collected, so you can analyse any sub-set of this data any time you want to, at any later point in time. That is the benefit of collecting all of the possible measurements together at the same time and then saving them to a file.

You can obviously use SADC to collect data whenever you have performance problems. But it is far better to run SADC all the time, so that you already have the data to analyse after someone reports a problem. One way to do this is to put an entry into a "crontab" file for use by the "cron" utility, which runs jobs at regular intervals. Putting an entry similar to the following into "crontab" for the system administrator for midnight would run SADC for 24 hours collecting measurements every minute. If this is too frequent just modify the time and number values by the same ratio, so that the two multiplied together give 86400 seconds (24 hours).
  • /usr/lib/sa/sadc 60 1440 /usr/adm/sa/`uname -n`_`date +%y%m%d`.sad &
This will create a file named after the system and the current date in a central directory. Make sure there is enough space in that location, or change it to somewhere else. You will need to do housekeeping once in a while to purge and delete old data sets. This could also be done via another "cron" to delete files older than one or two months.

I have found SADC and SAR invaluable in a number of circumstances where I have been able to show that it was the underlying hardware that had performance issues, and not the application itself. And I have used it in the opposite manner too, to show that a poorly performing application was not suffering from running out of system resources or their slow performance. On most UNIX systems the impact of running SADC is minimal, as the UNIX kernel is already collecting all of these measurements anyway. The overhead is simply of SADC reading them from memory and writing them to disk at the frequency requested. And generally this is a very low and negligible overhead, compared to that of the application software itself.