Using awk

Fields

Normally, awk reads its input one line, or record, at a time. A record is, by default, a sequence of characters ending with a newline character. awk then splits each record into fields; by default, a field is a string of non-blank, non-tab characters.

As input for many of the awk programs in this chapter, we use the file countries. Each record contains the name of a country, its area in thousands of square miles, its population in millions, and the continent on which it is found. (Data is from 1978; the CIS (former USSR) has been arbitrarily placed in Asia.) The white space between fields is a tab in the original input; a single blank space separates both ``North'' and ``South'' from ``America''. The following example displays the contents of an input file:

   CIS              8650	262	Asia
   Canada           3852	24	North America
   China            3692	866	Asia
   USA              3615	219	North America
   Brazil           3286	116	South America
   Australia        2968	14	Australia
   India            1269	637	Asia
   Argentina        1072	26	South America
   Sudan             968	19	Africa
   Algeria           920	18	Africa

This file is typical of the kind of data awk is good at processing -- a mixture of words and numbers separated into fields by blanks and tabs.

The number of fields in a record is determined by the field separator. Fields are normally separated by sequences of blanks and/or tabs, so the first record of countries has four fields, the second five, and so on. It is possible to set the field separator to just tab, so each line has four fields, matching the meaning of the data. We explain how to do this shortly. For the time being, let's use the default: fields separated by blanks or tabs. The first field within a line is called $1, the second $2, and so forth. The entire record is called $0.