|
|
Normally, awk reads its input one line, or record, at a time. A record is, by default, a sequence of characters ending with a newline character. awk then splits each record into fields; by default, a field is a string of non-blank, non-tab characters.
As input for many of the awk programs in this chapter, we use the file countries. Each record contains the name of a country, its area in thousands of square miles, its population in millions, and the continent on which it is found. (Data is from 1978; the CIS (former USSR) has been arbitrarily placed in Asia.) The white space between fields is a tab in the original input; a single blank space separates both ``North'' and ``South'' from ``America''. The following example displays the contents of an input file:
CIS 8650 262 Asia Canada 3852 24 North America China 3692 866 Asia USA 3615 219 North America Brazil 3286 116 South America Australia 2968 14 Australia India 1269 637 Asia Argentina 1072 26 South America Sudan 968 19 Africa Algeria 920 18 AfricaThis file is typical of the kind of data awk is good at processing -- a mixture of words and numbers separated into fields by blanks and tabs.
The number of fields in a record is determined by the field separator. Fields are normally separated by sequences of blanks and/or tabs, so the first record of countries has four fields, the second five, and so on. It is possible to set the field separator to just tab, so each line has four fields, matching the meaning of the data. We explain how to do this shortly. For the time being, let's use the default: fields separated by blanks or tabs. The first field within a line is called $1, the second $2, and so forth. The entire record is called $0.