|
|
A multiuser system in a company serves approximately 30 employees running a variety of packages including a simple database application (non-relational), an accounting package, and word processing software. At peak usage, there are complaints from users that response is slow and that characters are echoed with a noticeable time delay.
The system configuration is:
The system administrator is tasked with improving the interactive performance of the system. Funds are available for upgrading the machine's subsystems if sufficient need is demonstrated. Any change to the system must be undertaken with minimal disruption to the users.
The administrator ensures that system accounting is enabled using sar_enable(ADM), and produces reports of system activity at five-minute intervals during the working week by placing the following line in root's crontab(C) file:
0 8-18 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:00 -i 300 -AThe administrator notes the times at which users report that the system response is slow and examines the corresponding operating system activity in the report (effectively using sar -u):
08:00:00 %usr %sys %wio %idle ... 11:10:00 42 46 4 8 11:15:00 40 49 6 5 11:20:00 38 50 7 5 11:25:00 41 47 5 7 ...The system is spending a large amount of time in system mode and little time idle or waiting for I/O.
The length of the run queue shows that an unacceptably large number of user processes are lined up for running (sar -q statistics):
08:00:00 runq-sz %runocc swpq-sz %swpocc ... 11:10:00 4.3 85 11:15:00 7.8 98 11:20:00 5.0 88 11:25:00 3.5 72 ...An acceptable number of processes on the run queue would be two or fewer.
At times when the system response seems acceptable, the system activity has the following pattern:
08:00:00 %usr %sys %wio %idle ... 16:40:00 55 20 0 25 16:45:00 52 25 2 21 16:50:00 59 20 1 20 16:55:00 54 21 2 23 ...This shows that the system spends little time waiting for I/O and a large proportion of time in user mode. The
%idle
figure shows more than 20% spare
CPU capacity on the system.
The run queue statistics also show that user processes are
getting fair access to run on the CPU:
08:00:00 runq-sz %runocc swpq-sz %swpocc ... 16:40:00 1.0 22 16:45:00 2.1 18 16:50:00 1.6 9 16:55:00 1.1 12 ...
From the CPU utilization statistics, it looks as though the system is occasionally spending too much time in system mode. This could be caused by memory shortages or too much overhead placed on the CPU by peripheral devices. The low waiting on I/O figures imply that memory shortage is not a problem. If the system were swapping or paging, this would usually generate much more disk activity.
The administrator next examines the performance of the memory,
disk and serial I/O subsystems to check on their performance.
The memory usage figures for the period when the proportion of
time spent in system mode (%sys
) was high
show the following pattern (sar -r statistics):
08:00:00 freemem freeswp ... 11:10:00 1570 131072 11:15:00 1612 131072 11:20:00 1598 131072 11:25:00 1598 131072 ...The value of GPGSHI for this system is 300 and none of the swap space is allocated to processes -- there is no apparent evidence of swapping or paging to disk. This is confirmed by examining the reports for sar -w:
08:00:00 swpin/s bswin/s swpot/s bswot/s pswch/s ... 11:10:00 0.04 0.2 0.00 0.0 51 11:15:00 0.02 0.1 0.00 0.0 63 11:20:00 0.00 0.0 0.00 0.0 56 11:25:00 0.01 0.1 0.00 0.0 66 ...The zero values for
swpot/s
and bswot/s
indicate that there was no swapping out activity.
Examining the sar -q, sar -r and sar -w reports at other times shows occasional short periods of paging activity but these are correlated with batch payroll runs. It should be possible to reduce the impact of these on the system by rescheduling the jobs to run overnight.
The administrator next examines the buffer cache usage statistics for the same period (sar -b statistics):
08:00:00 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s ... 11:10:00 27 361 93 5 16 68 0 0 11:15:00 35 320 89 7 22 66 0 0 11:20:00 22 275 92 5 15 65 0 0 11:25:00 22 282 96 9 27 67 0 0 ...These figures show hit rates on the buffer cache of about 90% for reads and 65% for writes. Approximately 30KB of data (
bread/s
+ bwrit/s
) is
being read from or written to disk per second.
Disk performance is examined next using the statistics provided by sar -d:
08:00:00 device %busy avque r+w/s blks/s avwait avserv 11:10:00 Sdsk-0 0.91 3.70 2.37 13.15 12.42 4.60 Sdsk-1 25.01 1.62 11.39 55.21 3.26 5.30These results show that the busiest disk (Sdsk-1) has acceptable performance with a reasonably short request queue, acceptable busy values, and low wait and service times. The pattern of activity on the root disk (Sdsk-0) is such that the request queue is longer since requests are tending to arrive in bursts. There is no evidence that the system is disk I/O bound though it may be possible to improve the interactive performance of some applications by increasing the buffer cache hit rates.11:15:00 Sdsk-0 0.57 2.58 1.37 6.98 13.05 8.26 Sdsk-1 24.10 1.43 10.93 50.42 3.11 7.23
11:20:00 Sdsk-0 0.81 2.42 1.98 11.01 9.55 6.72 Sdsk-1 21.77 1.85 6.05 39.11 4.54 5.37
11:25:00 Sdsk-0 0.76 3.90 2.00 9.52 14.18 4.89 Sdsk-1 20.24 2.07 5.83 34.87 10.60 9.91
Based on the evidence given above, the system
would benefit from increasing the number of buffers in the buffer
cache. Although the system does not show much sign of being disk
I/O bound (sar -u shows %wio
less
than 15% at peak load), applications are placing a reasonably
heavy demand on the second SCSI disk (Sdsk-1).
This will affect the interactive response of programs which have
to sleep if the data being requested cannot be found in the
buffer cache. As the system does not appear to be short of
memory at peak load, the system administrator may wish to
experiment with doubling the size of the buffer cache by setting
NBUF to 6000. Based on the evidence from
sar -r that approximately 6MB
(1500 4KB pages) of memory are free at peak load,
doubling the size of the buffer cache will reduce this value
to about 3MB.
If the size of the buffer cache is increased,
the system should be monitored to see:
%wio
reported by sar -u)
%rcache
and
%wcache
reported by sar -b)
%busy
reported by sar -d)
freemem
is dropping near to or below the value of
GPGSHI)
If the interactive performance of applications is still less than desired, another possibility is to use intelligent serial I/O cards to relieve the processing overhead on the CPU. The serial multiport cards use 16450 UARTs and were previously used in two less powerful systems. It is possible that the CPU is spending too much time moving characters out to the serial lines on behalf of the serial cards. The CPU will do this whenever the applications need to refresh terminal screens to update database forms, word processor displays and so on.