DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

regcomp(S)


regcomp, regexec, regerror, regfree -- regular expression matching

Syntax

cc . . . -lc

#include <sys/type.h>
#include <regex.h>

int regcomp(regex_t *preg, const char *pattern, int cflags);

int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);

size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);

void regfree(regex_t *preg);

Description

This is a suite of routines that interpret the Basic and Extended Regular Expressions, as described in X/Open CAE Specification, System Interface Definitions, Issue 4, 1992 (XBD), Chapter 7, Regular Expressions. A regular expression, pattern, is matched against string. The result of this matching is stored in the structure array pmatch. These routines do not match filename patterns. For such purposes, use the fnmatch(S) routine instead.

regcomp(S) compiles pattern and stores the compiled results in the structure pointed to by preg.

regexec(S) compares the null-terminated string with the compiled regular expression preg, and stores the information about matches in the array pmatch.

regerror(S) maps the error return values from regcomp( ) and regexec( ) to more meaningful error messages.

regfree(S) frees any memory allocated by regcomp( ) for preg, but does not free the preg structure itself.

Two structure types are used in regular expression matching: the structure type regex_t and the structure type regmatch_t. The structure type regex_t is used to hold compiled regular expressions and must have the following members:

Member Type Member Name Description
size_t re_nsub Number of parenthesized
    subexpressions in pattern.

 +------------+-------------+----------------------------+
 |Member Type | Member Name | Description                |
 +------------+-------------+----------------------------+
 |size_t      | re_nsub     | Number of parenthesized    |
 +------------+-------------+----------------------------+
 |            |             | subexpressions in pattern. |
 +------------+-------------+----------------------------+
The structure type regmatch_t is used to hold the information about the substrings in string that match subexpressions in pattern and must have at least the following two members:

Member Type Member Name Description
regoff_t rm_so Byte offset, start of string
    to start of substring,
regoff_t rm_eo Byte offset of character
    immediately after end of substring,
    from the start of string.

 +------------+-------------+-------------------------------------+
 |Member Type | Member Name | Description                         |
 +------------+-------------+-------------------------------------+
 |regoff_t    | rm_so       | Byte offset, start of string        |
 +------------+-------------+-------------------------------------+
 |            |             | to start of substring,              |
 +------------+-------------+-------------------------------------+
 |regoff_t    | rm_eo       | Byte offset of character            |
 +------------+-------------+-------------------------------------+
 |            |             | immediately after end of substring, |
 +------------+-------------+-------------------------------------+
 |            |             | from the start of string.           |
 +------------+-------------+-------------------------------------+

Any pattern to be matched must first be compiled by regcomp( ). regcomp( ) compiles pattern and stores the compiled regular expressions in the structure pointed to by preg. regcomp( ) sets re_nsub of the structure to the number of parenthesized subexpressions found in pattern.

Subexpressions in pattern are enclosed by escaped parentheses as \( ... \) in basic regular expressions or by parentheses as ( ... ) in extended regular expressions. The i-th subexpression begins at the i-th matched open parenthesis from the left, with i counting from 1. The pattern itself as a whole is also a subexpression and is labeled as the 0-th subexpression.

regexec( ) compares the null-terminated string pointed to by string with preg, the compiled regular expression generated by regcomp( ) from pattern. If there is a match, regexec( ) returns zero and records the match in pmatch, an array of structures. If there is no match or if an error occurred, regexec( ) returns non-zero.

The array pointed to by pmatch consists of at least nmatch elements. Each element of the array is a structure of the type regmatch_t. If nmatch is zero, pmatch will be ignored by regexec( ) entirely. The flag REG_NOSUB (see below) also causes regexec( ) to ignore its pmatch argument.

The routine regexec( ) fills in the i-th element of the array with offsets of i-th substring that corresponds to i-th parenthesized subexpression of pattern. (Only one matched substring for each subexpression is recorded. See rule 1 below.) Offsets in pmatch[0] identify the substring matching the entire regular expression, strictly also a subexpression. If the total matches are no more than nmatch, regexec( ) fills the unused elements of the array, up to pmatch[nmatch-1], with -1 (pmatch may point to an array larger than nmatch entries). If pattern contains more than nmatch subexpressions, only the first nmatch matched substrings are recorded.

A subexpression is said not to participate in the match when

When matching a basic regular expression or extended regular expression against string, any particular parenthesized subexpression of pattern might participate in the match of more than one substring, or it might not match any substring at all, even though the pattern as a whole still matches.

In order to determine which substring's byte offsets are to be recorded in pmatch when information is stored about matches, the rules below are followed by regexec( ) when matching regular expressions:

  1. If the i-th subexpression is not part of another subexpression and it participated in at least one match, the last such match is recorded in pmatch[i] through its byte offsets.

  2. If the i-th subexpression is not part of another subexpression and did not participate in the match even though the whole pattern matches the string, byte offsets in pmatch[i] are set to -1.

  3. If the i-th subexpression is part of the j-th subexpression but does not belong to any other subexpression also contained in the j-th subexpression, and if a match of this j-th subexpression has been reported in pmatch[j], the contents of pmatch[i] depends on whether the i-th subexpression itself participated in the match reported in pmatch[j]: if the i-th subexpression participated in at least one match then pmatch[i] records the byte offsets of its last such match; if it did not participate in the match then the byte offsets in pmatch[i] are -1.

  4. If the i-th subexpression is part of the j-th subexpression and if j-th subexpression did not participate the match, i.e., the byte offsets in pmatch[j] are already set to -1, then the byte offsets in pmatch[i] will also be set to -1.

  5. If the i-th subexpression matched an empty (zero-length) string, both byte offsets in pmatch[i] will be the same. If the empty string is not at the end of the string, the byte offset is that of the next character after the empty string; if the empty string is at the end of the string then the byte offset is that of the null terminator.

For example, for the pattern ((a)|(c))* matched against the string "aa", the subexpression (a) participates in two matches, but only the second match is recorded. In addition, although the pattern matches the whole string, the subexpression (c) does not match anything. The contents of pmatch would therefore be:

       /* the whole expression */
   pmatch[0].rm_so = 0;    pmatch[0].rm_eo = 2;
       /* the outermost parentheses */
   pmatch[1].rm_so = 1;    pmatch[0].rm_eo = 2;
       /* the (a) subexpression */
   pmatch[2].rm_so = 1;    pmatch[2].rm_eo = 2;
       /* the (c) subexpression */
   pmatch[3].rm_so = -1;   pmatch[3].rm_eo = -1;
As another example, consider matching the pattern (a)*b against the string "b". The subexpression (a) matches an empty string at offset 0 of string "b" and thus:
   pmatch[1].rm_so = 0;    pmatch[1].rm_se = 0;
If the pattern were b(a)* instead, the subexpression (a) would match the empty string at the end of "b" and the contents of pmatch would be:
   pmatch[1].rm_so = 1;    pmatch[1].rm_se = 1;

The default behavior for compiling regular expressions and the subsequent matching to the string can be modified by two groups of flags: cflags, the argument to regcomp( ), and eflags, the argument to regcomp( ).

The value of cflags is formed from a bitwise inclusive OR of zero or more of the following flags, defined in <regex.h>:


REG_EXTENDED
Use extended regular expressions in the match. The default is to use the basic regular expression.

REG_ICASE
Ignore cases in the match. The default is to treat uppercase letters as different from lowercase letters.

REG_NOSUB
Determine only if pattern matches string when regexec( ) is called; do not find out and report any further details.

REG_NEWLINE
Treat newline characters specially.

The value of eflags is formed from the bitwise inclusive OR of zero or more of the following flags, also defined in <regex.h>:


REG_NOTBOL
Treat the character at the beginning of the string as not at the beginning of a line. If this flag is set, the beginning of the string will not be matched by the circumflex character ^ when it is used as a special character.

REG_NOTEOL
Treat the last character of the string as not at the end of a line. If this flag is set, the end of the string will not be matched by the dollar-sign character $ when it is used as a special character.

When the REG_NOSUB flag is set, regexec( ) will report only success or failure of a match, and ignore its pmatch argument. This flag also causes regcomp( ) to set re_nsub to an implementation defined value (which should not be used and never be changed through any other means) when the regular expression is compiled.

If REG_NEWLINE is not set in cflags, a newline character in pattern or string will be treated as any other ordinary character. If REG_NEWLINE is set, the newline character (in pattern or string) will still be treated as an ordinary character except that it is also given the properties of a line delimiter. Specifically, the newline character will have some additional special properties in the following three situations when REG_NEWLINE is set:

  1. A newline character in string cannot be matched by a single character wildcard (i.e., a period outside a bracket expression). Any form of a non-matching list (a bracket expression starting with a circumflex) also will not match a newline character in string.

  2. When a circumflex ^ is used in pattern to represent the beginning of a line, the circumflex will match the zero-length string immediately after the newline character in string. The setting of the flag REG_NOTBOL is ignored in this case.

  3. When a dollar-sign $ is used in pattern to represent the end of a line, the dollar-sign will match the zero-length string immediately before the newline character in string. The flag REG_NOTEOL is ignored in this case.

Meanings of error return values from regcomp( ) and regexec( ) may not be very obvious. regerror( ) provides mappings from these error codes to more meaningful error message strings. The message string produced by regerror( ) corresponds to errcode, the first argument to regerror( ). The value of errcode must be the last non-zero return value of either regcomp( ) or regexec( ) with the given value of preg, also passed to regerror( ). If any other value of errcode is passed to regerror( ), the contents of the generated string is undefined.

If preg is null but errcode is a value returned by a previous call to regcomp( ) or regexec( ), the routine regerror( ) still generates an error message string corresponding to the value of errcode, though the content might not be as detailed.

The buffer pointed to by errbuf is used to hold the generated string. The size of the buffer is errbuf_size bytes. If the string, including the terminating null, is longer than errbuf_size bytes, regerror( ) will truncate the string and null-terminate the result.

If errbuf_size is zero, regerror( ) returns the size of the entire error message string. Nothing is placed into the buffer and the errbuf argument is ignored entirely.

Return values

If regcomp( ) can compile the regular expression successfully, it returns zero. Otherwise, a nonzero integer error code is returned and the content of preg is undefined. These error codes are defined in <regex.h>, also described in the ``Diagnostics'' section below.

If regexec( ) can find a match, it returns zero. Otherwise, it returns REG_NOMATCH when no match can be found (or REG_ENOSYS if the function is not supported). As an SCO OpenServer specific extension, regexec( ) may also return REG_BADPAT. This may happen when, for example, the data in the compiled regular expression is corrupted and passed to regexec( ).

The routine regerror( ) returns the number of bytes required to hold the entire corresponding error message string upon successful completion. Otherwise, it returns zero to indicate that the routine is not supported.

regfree( ) does not return any value.

Diagnostics

These non-zero integer error codes may be returned by regcomp( ) or regexec( ):


REG_NOMATCH
No match can be found by regexec( ).

REG_BADPAT
There is an error in the regular expression pattern.

REG_ECOLLATE
The referenced collating element is not valid.

REG_ECTYPE
The referenced character class type is not valid.

REG_EESCAPE
An escape character \ appeared at the end of pattern.

REG_ESUBREG
The number in \digit is out of range or is in error.

REG_EBRACK
The square brackets [ and ] do not balance in the pattern.

REG_ENOSYS
The routine is not supported.

REG_EPAREN
The parentheses, \( and \) in basic regular expression and ( and ) in extended regular expression, do not balance.

REG_EBRACE
The curly braces \{ and \} do not balance.

REG_BADBR
Contents enclosed in \{...\} are illegal: not a number, number too large, more than two numbers, first number larger than the second.

REG_ERANGE
An invalid endpoint is used in a range expression.

REG_ESPACE
No more memory available.

REG_BADRPT
One of the special characters in ?, * or + is not preceded by a valid regular expression and is not escaped.

Warning

If locale is changed after a regular expression has been compiled by regcomp( ), recompile the regular expression by calling regcomp( ) again. Otherwise, the result of regexec( ) is undefined.

If the routine regexec( ) or regfree( ) is given a preg not returned by regcomp( ), the result is undefined. This includes the situations when a preg has been freed by regfree( ) or returned by a failed regcomp( ).

The return value REG_BADPAT by regexec( ) are SCO OpenServer specific. To retain XPG4 compliance for an application code when these features are used, compile the code conditionally, as in:

      switch( regexec(&preg, string, nmatch, pmatch, eflags)) {
      case 0:
   	   /* code for match */
   	   break;
      case REG_NOMATCH:
              /* code for no match */
   	   break;
      .
      .
      .
   #ifdef _POSIX_SOURCE
      case REG_BADPAT:
   	   /* handle invalid regular expression error */
              break;
   #endif /* REG_BADPAT */
      .
      .
      .
      }

See also

fnmatch(S), glob(S),
XBD, Chapter 7, Regular Expressions

Standards conformance

The routines regcomp( ), regerror( ), and regfree( ) are conformant with:
X/Open CAE Specification, System Interfaces and Headers, Issue 4, 1992.

The routine regexec(S) is conformant with:
X/Open CAE Specification, System Interfaces and Headers, Issue 4, 1992 with SCO OpenServer specific extensions that are maintained by The SCO Group. .

Examples

The example below shows how to determine quickly whether a pattern matches a string, without actually finding all the detailed matches.
   # include <sys/types.h>
   # include <regex.h>
   /*
    * match string against the extended regular expressions in pattern
    *
    * return 0 for match, non zero for no match or error
    */
   

int match(const char *string, char *pattern) { int status; regex_t preg;

if ( (status=regcomp(&preg, pattern, REG_EXTENDED|REG_NOSUB)) != 0 ) { return (status); /* flag error */ }

status = regexec(&preg, string, (size_t) 0, NULL, 0); regfree(&preg); return (status); }

To produce a more verbal error message, an application could call regerror(S) twice, the first time to determine the buffer size for the error message string and the second time to put the string into the buffer created for the string, as in
   extern char *errbuf;
    .
    .
    .
   if ( (status=regcomp(&preg, pattern, REG_EXTENDED|REG_NOSUB)) != 0 ) {
           /*
   	   there is an error: status != 0
              first, find out the size of the error message string */
     buf_size = regerror(status, preg, (char *) NULL, (size_t) 0);
           /*
   	   then create a buffer for the message string */
     errbuf   = (char *)malloc(buf_size);
           /*
   	   and put the string into buffer before return */
     regerror(status, preg, errbuf, buf_size);
           /*
   	   flag the error */
     return  (status);
   }
    .
    .
    .

© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003