Automating frequent tasks

How programs perform

A general law of programming, proven through long experience, is that in any program the computer spends 90% of its time processing about 10% of the code. A second general law is that as programs age and are maintained, the changes introduced to them tend to add complexity to the original structure and reduce their efficiency. In this section, we'll look at program performance and means of improving it.

The flow of control within a program is determined by two types of construct; the loop construct and the branch construct. In batch programs such as filters, these are used in conjunction so that the program does something like this:

   # generic filter program
   #
   read command line arguments
   using getopts, for each flag {
   	set a variable
   	}
   open input and output files
   while (input != FALSE) {
   	read in some data
   	do something with it
   	write it to the output file
   	if an error occurred, exit with a message
   	}
   close input and output files
   exit

The first action taken by this generic program is to check its command line for flags. Using a loop, it reads through each argument in turn and sets up any internal variables it needs. This loop is only used by the program when it starts up; for this reason it is called initialization code.

Having ``parsed'' its arguments, the program now opens its data files. An input and an output file are the lowest common denominator; some programs open several files each for input and output, but this is a simple, generic example. Again, opening the files is only carried out once. Note that in a real program each attempt to open a file will be enclosed in an if construct that checks for errors; if the attempt fails, the else part of the if construct usually causes the program to exit with an error message.

The program now enters a loop, reading data from the input file, doing something to it, and writing it to the output file, while the input is available. (By convention, if an operation succeeds it usually returns a value of 0.) This is the meat of the program; it is where the activity for which the program was written takes place, and it is repeated for a number of times proportional to the amount of data in the input files.

When the program can no longer read any more input, it exits the main loop and executes the termination code of the program. Termination code is used to tidy up after the main loop; to close open files and write a final message to the output. (The command wc, which counts words, uses its termination code to print out a final sum of all the words it counted in its main loop.) This section of the program, like the initialization code, is only executed once.

This program structure is not universal, but it is sufficiently common to be worth using as a model to demonstrate how to tune your programs, and it accounts for the vast majority of shell scripts and non-interactive filters. While shell scripts rarely open data files and process them directly, they frequently invoke other programs which do just that; consequently, the same general techniques for improving performance are applicable to them.