lex

Routines to reprocess input

There are a number of lex routines that let you handle sequences of characters that are to be processed in more than one way.

yymore

The text matching a given pattern is stored in the yytext array. In general, after the action associated with it is performed, the characters in yytext are overwritten with succeeding characters in the input stream to form the next match. The function yymore() prevents this overwriting, and causes the characters matching the next pattern to be appended to those already in yytext. This allows you to process in the same action a set of characters associated with two (or more) successive pattern matches. It is useful when one string of characters is significant and a longer string, that includes the first, is significant as well.

Consider a character string bounded by B's and interspersed with one ``B'' at an arbitrary location. For example:

   BoomBoxB

You may want to count the number of characters between the first and second ``B'' and add it to the number of characters between the second and third ``B'', and print the result. (The last ``B'' is not to be counted.)

Code to do this is:

   %{
   int flag=0;
   %}
   %%
   B[^B]*    { if (flag == 0) {
                  flag = 1;
                  yymore();
              }
              else     {
                  flag = 0;
                  printf("%d\n",yyleng);
                }
             }

The variable flag is used to distinguish the character sequence terminating just before the second ``B'' from that terminating just before the third. Using the example input ``BoomBoxB'', the pattern first matches the string ``Boom'', causing that string to be put into yytext. Next, the pattern matches the string ``Box'', which would normally cause only that string to be put into yytext. However, yymore() was called in the action associated with the previous pattern match, so yytext contains ``BoomBox'', and yyleng consequently equals 7.

yyless

The function yyless() resets the end point of the current token. yyless() takes a single integer argument: yyless(n) causes the current token to consist of the first n characters of what was originally matched (that is, up to yytext[n-1]). The remaining yyleng-n characters are returned to the input stream. Consider the following specification:

   %%
   [A-Z][a-z]*     {if (yyleng>5)
                           yyless(5);
                   }
   [a-z]*          printf("%s",yytext);

The lexical analyzer generated from this specification removes the first 5 letters from any word that starts with an uppercase letter.

REJECT

REJECT allows the lexical analyzer to try to match the current token against the remaining patterns in the specification. Its function is the same as if yyless(0) were executed (that is, all the characters in the token were returned to the input stream), except that pattern matching resumes at the pattern following the current one, rather than at the first pattern in the specification. If you want to count the number of occurrences of both the regular expression ``snapdragon'' and its subexpression ``dragon'', the following works:

   snapdragon     {countflowers++; REJECT;}
   dragon         countmonsters++;

As an example of one pattern overlapping another, the following counts the number of occurrences of the expressions ``comedian'' and ``diana'', even where the input text has sequences such as ``comediana'':

   comedian      {comiccount++; REJECT;}
   diana         princesscount++;

Note that the actions here may be considerably more complicated than simply incrementing a counter.

End-of-file processing with yywrap

The routine yywrap() is used to deal with end-of-file processing. In its default form, yywrap() simply returns 1 if the end-of-file has been reached, and 0 otherwise. A user-defined yywrap() may be substituted to provide some other action at the end of input. This routine should be linked before the lex library.