|
|
A string constant is created by enclosing a sequence of characters inside quotation marks, as in ``abc'' or ``hello, everyone''. String constants can contain the C programming language escape sequences for special characters listed in ``Regular expressions''.
String expressions are created by concatenating constants, variables, field names, array elements, functions, and other expressions. The following program prints each record preceded by its record number and a colon, with no blanks:
{ print NR ":" $0 }This concatenates the three strings representing the record number, the colon, and the record, and prints the resulting string.
awk provides the built-in string functions shown in ``awk string functions''. In this table, r represents a regular expression, s and t are string expressions, and n and p are integers.
awk string functions
Function | Description |
---|---|
getline | reads next line of input |
gsub(r,s) | substitutes s for r globally in current record, returns number of substitutions |
gsub(r,s,t) | substitutes s for r globally in string t, returns number of substitutions |
index(s,t) | returns position of string t in s, 0 if not present |
length(s) | returns length of s |
match(s,r) | returns the position in s where r occurs, 0 if not present; see built-in variables RSTART and RLENGTH |
split(s,a) | splits s into array a on FS, returns number of fields |
split(s,a,r) | splits s into array a on r, returns number of fields |
sprintf(fmt,expr-list) | returns expr-list formatted according to format string fmt |
sub(r,s) | substitutes s for first r in current record, returns number of substitutions |
sub(r,s,t) | substitutes s for first r in t, returns number of substitutions |
substr(s,p) | returns suffix of s starting at position p |
substr(s,p,n) | returns substring of s of length n starting at position p |
tolower(s) | returns s translated into lowercase |
toupper(s) | returns s translated into uppercase |
{ print "skipping record for ",$1 getline print "going to record for ",$1 }This code reads a record, prints the specified string, then executes the getline function which passes control onto the next record without processing:
skipping record for CIS going to record for Canada skipping record for China ...For more information on getline, see ``Multiline records and the getline function''.
The functions sub and gsub are patterned after the substitute command in the text editor ed(C). The function gsub(r,s,t) replaces successive occurrences of substrings matched by the regular expression r with the replacement string s in the target string t. (As in ed, the left-most match is used and is made as long as possible.) gsub returns the number of substitutions made. The function gsub(r,s) is a synonym for gsub(r,s,$0). For example, the following program transcribes its input, replacing occurrences of ``USA'' with ``United States'':
{ gsub(/USA/, "United States"); print }Note that replacing the order of the commands in this action has an unexpected effect:
{ print gsub(/USA/, "United States",$0) }The exit value of the operation as performed on each record is displayed:
0 0 0 0 1 0 0In this case, only the fourth record of countries contains the string ``USA'': all other records return an exit value of 0.
The sub functions are similar to gsub, except that they only replace the first matching substring in the target string.
The function index(s,t) returns the left-most position where the string t begins in s, or zero if t does not occur in s. The first character in a string is at position 1. For example, the following command returns 2:
{ print index("banana", "an") }The length function returns the number of characters in its argument string; thus, the following prints each record, preceded by its length:
{ print length($0), $0 }($0 includes the input record separator but not the trailing newlines.) The following program prints the longest country name (``Australia''):
length($1) > max { max = length($1); name = $1 } END { print name }The match(s,r) function returns the position in string s where regular expression r occurs, or 0 if it does not occur. This function also sets two built-in variables RSTART and RLENGTH. RSTART is set to the starting position of the match in the string; this is the same value as the returned value. RLENGTH is set to the length of the matched string. (If a match does not occur, RSTART is 0, and RLENGTH is -1.) For example, the following program finds the first occurrence of the letter ``i,'' followed by at most one character, followed by the letter ``a'' in a record:
{ if (match($0, /i.?a/)) print RSTART, RLENGTH, $0 }This program produces the following output from the file countries:
16 2 CIS 8650 262 Asia 26 3 Canada 3852 24 North America 3 3 China 3692 866 Asia 24 3 USA 3615 219 North America 27 3 Brazil 3286 116 South America 8 2 Australia 2968 14 Australia 4 2 India 1269 637 Asia 7 3 Argentina 1072 26 South America 17 3 Sudan 968 19 Africa 6 2 Algeria 920 18 AfricaNote that the match function matches the left-most longest matching string. For example, if you use the string ``AsiaaaAsiaaaaan'' as an input record, the following program matches the first string of a's and sets RSTART to 4 and RLENGTH to 3:
{ if (match($0, /a+/)) print RSTART, RLENGTH, $0 }Consider the following function:
returns (without printing) a string containing the following,
formatted according to the printf specifications in the
string format:
expr1, expr2, ..., exprn
For a complete specification of these format conventions, see ``The printf statement''.
The following statement assigns to x the string produced by formatting the values of $1 and $2:
x = sprintf("%10s %6d", $1, $2)It is assigned as a 10-character string and a decimal number in a field of width at least six; x can be used in any subsequent computation or display operation. For example:
{ x=sprintf("%10s%6d",$1,$2); print x }This program produces the following output:
CIS 8650 Canada 3852 China 3692 USA 3615 Brazil 3286 Australia 2968 India 1269 Argentina 1072 Sudan 968 Algeria 920 CIS 8650 Canada 3852 China 3692 USA 3615 Brazil 3286 Australia 2968 India 1269 Argentina 1072 Sudan 968 Algeria 920The function substr(s,p,n) returns the substring of s that begins at position p and is at most n characters long. If substr(s,p) is used, the substring goes to the end of s; that is, it consists of the suffix of s beginning at position p. For example, we could abbreviate the country names in countries to their first three characters by invoking the following program:
This produces the following output:
CIS 8650 262 Asia Can 3852 24 North America Chi 3692 866 Asia USA 3615 219 North America Bra 3286 116 South America Aus 2968 14 Australia Ind 1269 637 Asia Arg 1072 26 South America Sud 968 19 Africa Alg 920 18 AfricaNote that setting $1 in the program forces awk to recompute $0 and, therefore, the fields are separated by blanks (the default value of OFS), not by tabs. Attempting to change the setting of OFS back to a tab character with the command { OFS="\t" } has the following result (only the first two lines are shown):
CIS 8650 262 Asia Can 3852 24 North AmericaNote that this has had the undesirable effect of tab-separating ``North'' and ``America'' as well as the genuine fields.
Strings are stuck together (concatenated) by writing them one after another in an expression. For example, consider the following program:
{ s = s substr($1, 1, 3) " " } END { print s }When invoked on the file countries, the program prints the following by building s up, one piece at a time, from an initially empty string:
CISCanChiUSABraAusIndArgSudAlg