Tuning memory resources

Case study: memory-bound software development system

This case study examines a system in which a large number of software developers are looping through the process of edit-compile-run a program. It is therefore a system on which a relatively small number of CPU-intensive jobs are constantly run. These are likely to be jobs that require a considerable amount of I/O, memory, and CPU-time. This is because compilers are usually large programs that access many files, create large data structures, and can be a source of memory contention problems.

System configuration

The system's configuration is as follows:

The system is isolated from other machines on the company's LAN with the connection primarily being used for e-mail traffic. No remote filesystems are mounted or exported.

Defining a performance goal

At installation, the system administrator knew the type of work that the machine would need to process, and so set the goal of maximizing I/O throughput. This was achieved by tuning the system so that the amount of time the machine spent performing disk I/O would be as low as possible. The system administrator also set up an entry in root's crontab file to record system data at one-minute intervals during the working day:

   * 8-18 * * 1-5 /usr/lib/sa/sa1
Recently, through observation and complaints from others, it has become apparent that the system is slowing down: in particular, the system's users have experienced slow response time. The goal is to restore the system to its initial performance.

Collecting data

The system administrator runs sar to examine the statistics that have been collected for the system. The following output is an example showing the CPU utilization at peak load when the system was first tuned:

   08:00:00    %usr    %sys    %wio   %idle
   14:06:00      53      25       2      20
   14:07:00      55      23       1      21
   14:08:00      52      20       3      25
Examining the situation now shows a much different pattern of usage:
   08:00:00    %usr    %sys    %wio   %idle
   10:51:00      35      37      28       0
   10:52:00      29      44      26       1
   10:53:00      32      38      30       0
The %wio figure is high (consistently greater than 15%) which indicates a possible I/O bottleneck. The cause of this could be related to the demands of the applications being run, or it could also be caused by swapping activity if the system is short of memory.

Formulating a hypothesis

The sar -u report shows that the system is spending a greater proportion of its time waiting for I/O and in system mode.

If the system is memory bound, this may also be causing a disk I/O bottleneck. Alternatively, if the problem was predominantly I/O based, the slowness could be caused by uneven activity across the system's disks or by slow controllers and disk drives being unable to keep up with demand. Another possibility is that the buffer cache may not be large enough to cope with the number of different files being compiled and the number of libraries being loaded.

If the problem is lack of memory, it could be that the system is constantly paging and swapping. Paging out to the swap areas need not be a major cause of performance degradation, but swapping out is usually an indication that there is a severe memory shortage. As a consequence, disk I/O performance can degrade rapidly if the disks are busy handling paging and swapping requests. In this way, high memory usage can lead very quickly to disk I/O overload. It also requires the kernel to expend more CPU time handling the increased activity. Preventing memory shortages helps to improve disk I/O performance and increases the proportion of CPU time available to user processes.

Getting more specifics

To confirm the hypothesis that the system is memory bound, the system administrator next examines the performance of the memory and I/O subsystems.

Memory investigation

The system administrator uses sar -r to report on the number of memory pages and swap file disk blocks that are currently unused:

   08:00:00 freemem freeswp
   10:51:00      44    2056
   10:52:00      42    1720
   10:53:00      41    1688
Since the number of free pages in the freemem column is consistently near the value defined for the GPGSHI kernel parameter (40), the page stealing daemon is probably active. Sharp drops in the amount of free swap space, freeswp, also indicate that the stack and modified data pages of processes are being moved to disk.

The average value of freemem indicates that there is only about 200KB of free memory available on the system. This is very low considering that the total physical memory size is 24MB. The system is also dangerously close to running out of swap as the average value of freeswp indicates that there is only about 910KB of space left on the swap device. The shortage of swap space is even more apparent when the swap -l command is run:

   path               dev  swaplo blocks   free
   /dev/swap          1,41      0  96000   1688
Only 1688 disk blocks remain unused out of 64000 configured on the swap device.

More evidence of swapping is found by running sar -q:

   08:00:00 runq-sz %runocc swpq-sz %swpocc
   10:51:00     2.7      98     1.0      36
   10:52:00     2.0      63     3.0      31
   10:53:00     2.0      58     1.0      49
The non-zero values in the swpq-sz and %swpocc indicate that processes are ready-to-run which have been swapped out.

To see evidence of swapping activity, the administrator uses sar -w:

   08:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s
   10:51:00    0.52    12.1    1.01    19.2      72
   10:52:00    1.21    22.5    3.02    37.4      55
   10:53:00    0.71    15.2    0.97     7.3      83
The values of swpot/s and bswot/s are both well above zero. This shows that the system was frequently swapping during the sampling period.

The evidence confirms the comments of the users, and the suspicions of the administrator, that the system has a memory shortage.

I/O investigation

To investigate further, the system administrator uses sar -b to display statistics about the buffer cache:

   08:00:00 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
   10:51:00     239     723      67      7       16      58       0       0
   10:52:00     448    1280      65      10      22      56       0       0
   10:53:00     374    1100      66      11      25      57       0       0
``rcache'' at 66% and ``wcache'' at 57% indicate low hit rates in the buffer cache. Although these figures are low, the priority must be to reduce the memory shortage in the system. Tuning the buffer cache hit rate must be left to a later stage as increasing the buffer cache size will further reduce the amount of available memory.

Finally, the system administrator checks disk I/O performance using sar -d:

   08:00:00 device   %busy    avque    r+w/s   blks/s   avwait   avserv
   10:51:00 Sdsk-0   85.42    1.89     39.39   166.28   80.26    25.24

10:52:00 Sdsk-0 86.00 1.79 38.73 163.64 82.35 25.87 Sdsk-1 10.01 1.16 12.37 23.11 3.24 20.19

10:53:00 Sdsk-0 87.00 1.92 38.07 171.95 78.32 26.32


The value of ``avque'' for the root disk (Sdsk-0) is consistently greater than 1.0 and is continually more than 80% busy. This indicates that the device is spending too much time servicing transfer requests, and that the average number of outstanding requests is too high. This activity is the combined result of the paging and swapping activity, and disk access by user processes. The second SCSI disk (Sdsk-1) does not contain a swap area and is much less active.

Making adjustments to the system

Since the system is both memory and I/O bound, it is likely that disk I/O performance is being made worse by the constant paging and swapping, so the sensible approach is to attack the memory problems first. As the system is almost completely out of memory and swap space, increasing these resources is a priority.

There are several ways to increase the amount of memory available to user processes on this system:

Solutions that will not help include:
Previous topic: Making adjustments to the system

© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003