DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

lex(CP)


lex -- a lexical-analyzer generator

Syntax

lex [ -ctvnV -Q[y|n] ] [ file ] ...

Description

The lex command generates C code which implements a lexical analyzer--a routine which reads input text and separates it into tokens. The lex input specifications (that is, all the input files concatenated together) consist of three sections: declarations; a rules section consisting of regular expressions (patterns) which define the token classes and usually some C code to be executed when tokens are found; and subroutines. The first and third sections are optional. The sections are delimited by the sequence %%. The rules section must start with this delimiter.

lex generates a file of C code called lex.yy.c. This file must be compiled by the C compiler and linked with a main routine. The program should be linked with the lex library, using the -ll option to cc or ld. This library supplies a main routine. The lexical analyzer routine produced is called yylex. This routine reads its input and, when a token is recognized, executes the code associated with the token class. The default action is to write the token to the standard input. The string matched by the regular expression defining the token class is placed in yytext, a character array. The variable yyleng gives the length of this array. This value of yytext may be copied into an external array to make it available to other routines.

The regular expressions understood by lex contain many of the usual operators and special characters. The following table summarizes these:

string the literalstring
* zero or more occurrences of the preceding pattern
+ one or more occurrences of the preceding pattern
? zero or one occurrences of the preceding pattern
. any single character
| alternation
( ) used for grouping
~ beginning of an input line
^ end of an input line
pattern{n,m} n to m occurrences of pattern
pattern{n} n occurrences of pattern
[string] any character in string
[^string] any character not in string
[char1-char2] any character in the range char1-char2

                    string       the literalstring
                       *         zero or more occurrences of
                                 the preceding pattern
                       +         one or more occurrences of
                                 the preceding pattern
                       ?         zero or one occurrences of
                                 the preceding pattern
                       .         any single character
                       |         alternation
                      ( )        used for grouping
                       ~         beginning of an input line
                       ^         end of an input line
                 pattern{n,m}    n to m occurrences of
                                 pattern
                  pattern{n}     n occurrences of pattern
                   [string]      any character in string
                   [^string]     any character not in string
                 [char1-char2]   any character in the range
                                 char1-char2
Special characters can be escaped or quoted if they are to be used as ordinary characters. The standard C escape sequences are understood. Regular expressions may be concatenated. The character ``/'' in an expression indicates that the expression that follows must be matched in order for the token to be matched; only the part of the expression up to the slash is placed in yytext.

The declarations section of a lex input file may contain variable declarations, #include statements, and abbreviations for regular expressions. The subroutines section contains user-defined functions used by the lexical analyzer.

Any line beginning with a blank is assumed to contain only C text and is copied to the file lex.yy.c; if it is in the declarations section, it is copied into the external definition area of the lex.yy.c file. Variable declarations and #include statements should be placed in a section delimited by %{ and %}. Abbreviations consist of a symbol on the left of the line and its replacement text to the right. When abbreviations are used they are surrounded by curly braces, {}.

Three I/O routines are defined: input() reads a character; unput(c) returns a character to the input stream; output(c) outputs a character. These routines may be redefined by the user.

Other built-in routines include the following: REJECT, on the right side of the rule, causes the match to be rejected and the next suitable match executed; the function yymore() accumulates additional characters into yytext; the function yyless(p) pushes back the portion of the string matched beginning at position p.

The variable names generated by lex all begin with the prefix yy or YY. Users should avoid defining variables starting with these prefixes.

The lexical analyzer's implementation involves finite state machine; this state machine can be configured in the declarations section. This is done with a declaration of the following form, where x is a key letter, and n is an integer:

   %x n
The following parameters may be set in this way:

Key letter Meaning Default
p number of positions 2500
n number of states 500
e number of parse tree nodes 1000
a number of transitions 2000
k number of packed character classes 1000
o size of output array 3000

          Key letter   Meaning                              Default
              p        number of positions                   2500
              n        number of states                       500
              e        number of parse tree nodes            1000
              a        number of transitions                 2000
              k        number of packed character classes    1000
              o        size of output array                  3000
The use of one or more of the above automatically causes a summary of statistics to be printed. See -v and -n options, below.

Options

The options must appear before any files.

-c
Indicates C actions and is the default.

-t
This causes the generated code to be written to the standard out rather than to lex.yy.c

-v
Provides a one-line summary of statistics. This is flagged automatically if any finite state machine parameters are set.

-n
Suppresses the summary of statistics even if -v is turned on.

-V
Print out version information on standard error.

-Q(y|n)
Print out version information to output file lex.yy.c by using -Qy. The -Qn option does not print out version information and is the default.

Multiple files on the command line are concatenated and treated as a single file. If no files are given, standard input is used.

Notes

lex produces C code which may be compiled with C++. All functions produced and used by lex-produced code explicitly have "C" linkage. In particular, user-supplied version of input(), output(), and unput() may be written in C++, but must have "C" linkage, and the function yylex must be declared extern "C" if called by C++ code.

Examples

The following is an example of a lex specification. It shows the use of each of the three sections in the input.
   %{
   #include "global.h"
   int count;
   %}
   D        [0-9]
   %%
   if       {
                   printf("IF statement\n");
                   count++;
            }
   [a-z]+    printf("tag, value %s\n",yytext);
   0{D}+      printf("octal number %s\n",yytext);
   {D}+       printf("decimal number %s\n",yytext);
   "++"       printf("unary op\n");
   "+"        printf("binary op\n");
   "/*"       skipcommnts();
   %%
    skipcommnts()
    {
           for (;;)
           {
                   while (input() != '*')
                           ;
                   if (input() != '/')
                           unput(yytext[yyleng-1]);
                   else
                           return;
           }
    }

See also

yacc(CP)

Standards conformance

lex is conformant with:

X/Open Portability Guide, Issue 3, 1989 .


© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003