lex

Rules section

Each rule consists of a pattern to be searched for in the input, followed on the same line by an action to be performed when the pattern is matched. Because lexical analyzers are often used in conjunction with parsers, as in programming language compilation and interpretation, the patterns can be said to define the classes of tokens that may be found in the input.

Regular expressions

The patterns describing the classes of strings to be searched for are written using regular expressions in a notation similar to that used in awk(C) and sed(C). The terms ``pattern'' and ``regular expression'' are often used interchangeably. A regular expression is formed by concatenating characters and, usually, certain operators. This notation used with lex is summarized in the following list:

A string of text characters with no operators at all just matches the literal string. To match the word ``orange'', use:
```
   orange
```
To match a literal string that contains spaces or tabs, surround the expression with double quotes. To match the phrase ``red apple'', use the expression:
```
   "red apple"
```
An expression, followed by the ``'' operator, matches 0 or more occurrences of that expression. To match a string containing any number of ``m'''s, or the null string, use the expression:
```
   m*
```
An expression, followed by the ``+'' operator, matches one or more occurrences of that expression. To match a string containing one or more ``m'''s, but not the null string, use the expression:
```
   m+
```
An expression, followed by the ``?'' operator, matches 0 or 1 occurrence(s) of that expression. This is equivalent to saying that the expression is optional. To match one occurrence of the letter ``m'', or the null string, use the expression:
```
   m?
```
The period character, (.), matches any single character. To match any five-letter string starting with ``m'' and ending with ``y'', use the expression:
```
   m...y
```
Alternation in regular expressions is supported using the vertical bar, (|). To match either of the strings ``love'' and ``money'', use the expression:
```
   love|money
```
Expressions may be grouped using parentheses, '(' and ')'. To match a string that consists of any number of a's and b's, followed by a ``c'', use the expression:
```
   (a|b)*c
```
The circumflex, (^), followed by a pattern, signifies that the pattern must match at the beginning of a line. The following rule matches the word ``First'' at the beginning of a line:
```
   ^First
```
The dollar sign, ($) is appended to a pattern to indicate that it must match at the end of a line. The following rule matches the word ``cow'' at the end of a line:
```
   cow$    
```
To indicate that a regular expression should be matched a specific number of times, follow that expression with a number enclosed in curly braces, '{' and '}'. To match three repetitions of ``cd'', that is, ``cdcdcd'', use the expression:
```
   (cd){3}
```
To specify a range of repetitions, follow the expression by two numbers, separated by a comma and enclosed in curly braces. To match three, four, or five repetitions of ``ab'', that is, ``ababab'', ``abababab'', or ``ababababab'', use the expression:
```
   (ab){3,5}
```
A sequence of characters inside square brackets, '[' and ']', matches any one character in the sequence. To match any one of ``d'', ``g'', ``k'', and ``a'', use the expression:
```
   [dgka]
```
If the circumflex, (^), is the first character inside the square brackets, then the pattern matches any character that does not appear inside the brackets. In this context, the circumflex does not signify the start of a line, as it does when prepended to a pattern. To match any character other than ``a'', ``b'', and ``c'', use the expression:
```
   [^abc]
```
Ranges within a standard alphabetic or numeric order are indicated with a hyphen, (-). The following expression matches any digit, uppercase letter, or lowercase letter:
```
   [0-9A-Za-z]
```
Regular expressions can be concatenated. The resulting expression matches whatever the first expression matches followed by whatever the second expression matches. The following regular expression matches an identifier in many programming languages. An identifier, thus defined, is a letter followed by zero or more letters or digits:
```
   [a-zA-Z][0-9a-zA-Z]*
```
To treat an otherwise special character as a literal character, rather than as a special character, enclose the character in quotation marks or precede it with a backslash (\). Either of the following expressions could be used to match an asterisk followed by one or more digits:
```
   \*[0-9]+
   "*"[0-9]+
```
To recognize a backslash itself, either of these expressions could be used:
```
   \\
   "\"
```
lex understands the standard C escape sequences, such as \n for the end-of-line.

Actions

An action is a block of C code that is executed whenever the corresponding pattern in the lex specification is matched. Once the lex-generated lexical analyzer matches a regular expression specified in a rule in the specification, it looks to the right of the rule for the action to be performed. Actions typically involve operations such as a transformation of the matched string, returning a token to a parser, or compiling statistics on the input.

The simplest action contains no statements at all. Input text that matches the pattern associated with a null action is ignored. A sequence of characters that does not match any pattern in the rules section is written to the standard output without being modified in any way. To cause lex to generate a lexical analyzer that prints everything in the input text with the exception of the word ``orange'', which is ignored, the following rules section is used:

   %%
   orange  ;

Note that there must be some white space (spaces or tabs) between the pattern and the semicolon.

You may want to print out a message noting that a string of text was found, or a message transforming the text in some way. To recognize the expression ``Amelia Earhart'', the following rule can be used:

   "Amelia Earhart"   printf("found Amelia's bookcase!\n");

To replace a lengthy medical term with its acronym, a rule such as this is called for:

   Electroencephalogram    printf("EEG");

In order to count the lines in a text file, the analyzer must recognize end-of-lines and increment a counter. The following rule is used for this purpose:

   %{
   int lineno=0;
   %}
   %%
   \n   lineno++;

NOTE: If an action consists of two or more C statements spread over two or more lines, the code must be enclosed in curly braces, '{' and '}'.

yytext, yyleng

When a character string matches some pattern in the lex specification, it is stored in a character array called yytext. The contents of this array may be operated on by the action associated with the pattern: it can be printed or manipulated as necessary. lex also provides a variable yyleng, which gives the number of characters matched by the pattern.

For example, the following rule directs the lexical analyzer to count the digit strings in an input text and print the running total, and print out the text of each string as soon as it is found:

   %{
   int digstringcount=0;
   %}
   %%
   [-+]?[0-9]+     {
                           digstringcount++;
                           printf("%d %s\n",digstringcount,yytext);
                   }

This specification matches negative digit strings, and positive strings whether or not they are preceded by a plus sign; the ``?'' indicates that the preceding sign is optional.

ECHO

The macro ECHO is a shorthand way of printing out the text of the token. The two rules in the next example have the same effect:

   Jim|James       { ECHO; }
   Jim|James       { printf("%s",yytext); }

The following lex specification draws together several of the points discussed previously.

  1 %{
  2 int subprogcount = 0;
  3 int gstringcount = 0;
  4 %}
  5 %%
  6 -[0-9]+                printf("negative integer\n");
  7 "+"?[0-9]+             printf("positive integer\n");
  8 -0\.[0-9]+             printf("negative real number, no whole number part\n");
  9 rail[ ]+road           printf("railroad is one word\n");
 10 crook                  printf("Here's a crook!\n");
 11 function               subprogcount++;
 12 G[a-zA-Z]*             {
 13                        printf("may have a G word here: %s\n ", yytext);
 14                        gstringcount++;
 15                        }

The first three rules (lines 6-8) recognize negative integers, positive integers, and negative real numbers between 0 and -1. The fourth rule (line 9) matches cases where one or more blanks intervene between the two syllables of the word ``railroad''. The fifth specification (line 10) matches the word ``crook'' and prints a useful warning. The rule recognizing ``function'' (line 11) increments a counter. The last rule (lines 12-15) illustrates a multiline action, and the use of yytext.