Context sensitivity

lex has a number of mechanisms that help deal with problems of context sensitivity.

Trailing context

A potential problem exists when the lexical analyzer must read characters beyond the pattern being sought because it cannot be sure it has found the pattern until some additional information is known about the ``context'' in which it appears. A classic example of this involves the DO statement in FORTRAN. Consider the following DO statement:

   DO 50 k = 1 , 20, 2
Consider the sequence consisting of the characters preceding the comma in the statement above:
   DO 50 k = 1
Because FORTRAN ignores blanks, this sequence might be equivalent to the assignment statement:
   DO50k = 1
It is not possible to know that the ``1'' is the initial value of the index ``k'', and that the characters "DO" are a keyword, not the first two characters of an identifier, until the first comma is read. Therefore the lexical analyzer would not always interpret the string "DO" in the desired way if the lex specification contained the rule:
   DO      {return(DOTOK);}
The way to handle this is to use the slash, (/) which signifies that what follows is ``trailing context''. Trailing context is a second pattern that is expected to follow the token that is being searched for. The token is not matched unless the trailing context is also matched. The string that matches the trailing context is not stored in yytext, because it is not part of the token itself. The pattern to recognize the FORTRAN DO statement could be:
   DO/[ ]*[0-9]+[ ]*[a-zA-Z0-9]+[ a-zA-Z0-9]*=[ ]*[a-zA-Z0-9]+[ a-zA-Z0-9]*[ ]*\,
(To simplify the example, the rule accepts an index name of any length.) The ``$'' operator, discussed in the section on regular expressions, is a form of trailing context. Note that this operator is not exactly the same as a newline, \n. Consider the specification:
   hello\n         { printf("%s",yytext);
   world$          { printf("%s",yytext);
The token matched by the first rule is printed with the newline still attached as part of the token. The token matched by the second rule is printed without a newline; a newline matched by ``$'', like any trailing context, is not part of the token.

Starting state

lex allows you to set a kind of flag called a starting state that designates certain rules as applying only when that state is active. There are a number of steps that have to be followed to employ this mechanism.

For the following example, assume you are dealing with a programming language in which programs start with the keyword #go, end with the keyword #stop, and includes the keyword while. Assume also that the input text contains pieces of code interspersed with blocks of text. In this case, there needs to be a way for the lexical analyzer to distinguish between the word while as a keyword and as an ordinary word.

  1 %{
  2 #include "defs.h"
  3 %}
  4 %start PROG
  5 %%
  6 #go             { BEGIN PROG; }
  7 <PROG>while     { return(WHILE); }
  8 ","             { return(','); }
  9 [a-zA-Z]*       { put_in_tab(yytext);}
 10 <PROG>#stop     { ECHO; BEGIN 0;}
 11 .
 12 .
 13 .
 14 %%
 15 int put_in_tab()
 16 {
 17 .
 18 .
 19 .
 20 }
The line in the declarations section beginning with %start (line 4) is necessary to define the state PROG. This state indicates to the lexical analyzer that it is reading code, not text.

The first rule (line 6) determines that when the analyzer reads the keyword #go it activates the state PROG by means of the BEGIN macro followed by the state name.

A rule is associated with a state by prefacing it with the state name enclosed in angle brackets, '<' and '>'. A rule that has been so designated is applied if and only if that state is active. According to rule two (line 7), if the state PROG is active, then if the character sequence while is seen, a token is returned indicating that it is a keyword.

A rule that is not prefaced by a state name is applied no matter which state, if any, is active. Rule three (line 8) is an example of this. Rule four (line 9) in the example is also not associated with any state. It will, not, however, match the word while if the state PROG is active. This is because that word will have matched the pattern in the earlier rule, rule two, according to the first disambiguating rule discussed in ``Disambiguating rules''.

Rule five (line 10) in the specification deactivates the state PROG if that state is active and the keyword #stop is read. "BEGIN 0" deactivates the current state and does not make any other state active.

Programming techniques such as flag variables may also be used to mark context-sensitive conditions.

Next topic: lex I/O routines
Previous topic: Disambiguating rules

© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003