LWRE(3)

LWRE(3)                                                                LWRE(3)



NAME
       lwre - lightweight regular expression matching

SYNOPSIS
       #include <lwre.h>

       struct re RE_INITIALISER(expression);

       struct re *re_new(char *expression);
       int  re_match(struct re *re, char *string);
       void re_reset(struct re *re, char *expression);
       void re_release(struct re *re);
       void re_free(struct re *re);

       char *re_escape(char *string, int liberal);

INTRODUCTION
       liblwre  is  a  lightweight regular expression library intended for the
       occasions when you want to search in text with a minimum of fuss, bloat
       and  complexity.   Compared  to  other common solutions it is ten times
       smaller than the POSIX regex(3) functions  and  more  than  thirty-five
       times  smaller  than  the  popular  pcre(3)  package.  It has far fewer
       bells, whistles, dials and guages, but in return  offers  the  simplest
       possible  interface,  very reasonable performance, and freedom from the
       exponential complexity that renders backtracking  implementations  use-
       less for certain kinds of regular expression.

DESCRIPTION
       The  re_new()  function  allocates a new regular expression (RE) object
       that will match the given regular expression and returns a  pointer  to
       it.  The RE object contains the following publicly-accessible fields:

           struct re {
               char  *expression;    /* the associated regular expression */
               char  *error_code;    /* an error code, or zero */
               char  *error_message; /* an error message, or NULL */
               char  *position;       /* character after error or match */
               char **matches;        /* pointers to submatches, or NULL */
               int    nmatches;       /* the number of submatches, or zero */
           };

       The re_match() function matches the given string against the expression
       associated with the re object and returns the number of characters that
       were  sucessfully  matched,  or -1 if no match was found.  After a suc-
       cessful match it sets the matches field to an array of character point-
       ers  identifying  the submatches found within the input string and sets
       nmatches to the number of pointers in the  matches  array.   (See  SUB-
       MATCHES for details.)

       The  re_reset()  function  associates  a new regular expression with an
       existing re object.

       The re_release() function releases all resources associated with an  re
       object but not the object itself.  (The object can be reused by calling
       re_reset.)

       The re_free() function releases all resources associated  with  the  re
       object, including re itself.

       The RE_INITIALISER() macro can be used to initialise the contents of an
       RE object that is allocated in static or automatic storage.   The  pro-
       gram  can call BR re_release () on such objects when they are no longer
       needed, and re_reset() to reuse them with a different  expression,  but
       it must never call re_free() for them.

       The  re_escape()  function rewrites the contents of the given string to
       replace any escape sequences with the character  that  they  represent.
       (It is not involved in the regular expression matching achinery but may
       prove useful when reading regular expressions from a  file  to  convert
       escape  sequences  into  non-printing  characters  within those expres-
       sions.)

RETURN VALUE
       The re_new() function returns  a  pointer  to  the  newly-allocated  RE
       object,  or  NULL if sufficient memory cannot be obtained from the sys-
       tem.

       The re_match() function returns a non-negative  result  indicating  the
       number  of  characters successfully matched (which might be zero if the
       regular expression permits that) or a negative error code (and sets the
       error  field  of  re  to a string describing the error and the position
       field of re to the location of the problematic character in the associ-
       ated expression).

       The  RE_INITIALISER()  macro expands to a structure initialiser expres-
       sion corresponding to an RE object containing the given regular expres-
       sion.

       The re_escape() function always returns its first argument.

DIAGNOSTICS
       The  re_match()  function  will  fail if the regular expression is mal-
       formed or if it does not match the given string.  In case of failure it
       returns a negative error code and sets the error field of the RE object
       to a string describing the problem.  Symbolic constants are defined for
       the error codes as follows:

       RE_ERROR_NONE
             No error (zero).

       RE_ERROR_NOMATCH
             The  regular expression did not match the string (-1).  The posi-
             tion field of the RE object will be set to the location in string
             at which a match became impossible.

       RE_ERROR_NOMEM
             A  system  memory  allocation routine (typically one of malloc(),
             calloc() or realloc()) failed and returned zero.

       RE_ERROR_CHARSET
             The final ']' in a character set was not found.   Either  it  was
             missing  entirely  or  a group delimited by parentheses or braces
             ended before the ']' was seen.

       RE_ERROR_SUBEXP
             The final ')' in a parenthesized  subexpression  was  not  found.
             Either  it  was  missing  entirely or a group delimited by braces
             ended before the ')' was seen.

       RE_ERROR_SUBMATCH
             The final '}'  in  a  brace-delimited  submatch  was  not  found.
             Either  it  was  missing entirely or a subexpression delimited by
             parentheses ended before the '}' was seen.

       RE_ERROR_ENGINE
             An internal error occurred in the matching  engine.   This  indi-
             cates  a  serious  problem  with the implementation and should be
             reported as a bug.

REGULAR EXPRESSIONS
       The regular expressions understood by lwre are composed of the  follow-
       ing elements:

       a     Any non-meta character matches itself.

       \a    An  escaped character matches the character (in this example 'a')
             even if it would otherwise be a  meta  character  (therefore  use
             '\\' to match a literal backslash).

       .     Dot matches any character (use '\.' to match a literal dot).

       [set] A character class matches any single character from the given set
             (use '\[' in  an  expression  to  match  a  literal  left  square
             bracket).   The  set consists of one or more character specifica-
             tions as follows:

             a     Any non-meta character represents itself literally.

             \a    An escaped character matches itself (even if it would  oth-
                   erwise be interpreted as a meta character).

             x-y   A  hyphen between two characters denotes the range of char-
                   acters that includes both characters and all characters  in
                   between  them.   Either  or  both  of the characters can be
                   escaped, but the hyphen must not be.  If the first  charac-
                   ter  lexicographically  follows  the  second then the range
                   contains only the first character.  (To  include  a  hyphen
                   character in the set either put the hypen first in the set,
                   or use '\-'.)

             ^     An initial circumflex negates  the  contents  of  the  set,
                   which  then implicitly contains every character not explic-
                   itly listed.  The set '[^0-9]' therefore matches  any  non-
                   digit  character.   (To include the circumflex character in
                   the set either place it anywhere but the beginning  of  the
                   set, or use '\^'.)

             \]    An  escaped  right  square  bracket  matches a right square
                   bracket (even in a position that  would  otherwise  delimit
                   the end of the set).

       Let  e  denote  any  regular expression, including the single-character
       expressions described above.  Expressions  are  combined  into  complex
       forms  using  the  following  operators (which are listed in precedence
       order from lowest to highest):

       e1e2  Concatenation  of  two  expressions  yields  an  expression  that
             matches  the  first  subexpression and then the second subexpres-
             sion.

       e1|e2 Alternation (the vertical bar character) matches either the first
             expression  or  the  second  expression.   If the two expressions
             would match simultaneously, the first has priority.  (Use '\|' to
             match a literal vertical bar character.)

       e?    The  postfix '?' operator causes the immediately-preceding subex-
             presion to be optional: it will match zero or once.

       e*    The postfix '*' operator causes the immediately-preceding  subex-
             presion to be matched any number of times (including none).

       e+    The  postfix '+' operator causes the immediately-preceding subex-
             presion to be matched one or more times.

       These three postfix operators are 'greedy'; if given a chance, starting
       with  the  leftmost,  they  each will consume as much input as possible
       providing the rest of the regular expression  continues  to  match  the
       input.   To  make them 'lazy' (consume as little of the input as possi-
       ble, provided the rest of the regular expression continues to match the
       input) follow them with an additional '?'  character; thus:

       e??   matches e zero or one time, preferring zero times if possible;

       e*?   matches e zero or more times, preferring zero times if possible;

       e+?   matches e one or more times, preferring once if possible.

       All six postfix operators have the same precedence.

       Expressions  can  be  grouped  either to modify precedence or to record
       submatch ranges.

       (e)   Parentheses group one or more expressions together.   The  result
             acts as a single atomic expression.

       {e}   Curly  braces  behave like parentheses except that the portion of
             the input string matched by the expression(s)  between  the  left
             and  right  brace  will  be recorded as a submatch in the matches
             field of the RE object.  (See SUBMATCHES for details.)

       Parentheses and braces can be nested and must be correctly balanced.

SUBMATCHES
       After a successful match, each successive adjecent pair of pointers  in
       the  matches field of the RE object identifies the portion of the input
       string corresponding to each successive submatch indicated in the regu-
       lar  expression.  These pairs of pointers follow the usual C convention
       of being inclusive of the first character in the string  and  exclusive
       of  the  last  character: the second pointer of each pair points to the
       character after the last chacracter of the input string belonging to  a
       submatch.  In other words, the nth submatch begins at

           matches[2 * n]

       and ends at the character before

           matches[2 * n + 1]

       and is of length

           matches[2 * n + 1] - matches[2 * n]

       for 0 <= n < nmatches/2.

       The  memory  pointed  to by matches is managed by the RE implementation
       and will be invalidated on the next call to any re_*() function.   This
       memory  may not be allocated using malloc() and should never be free()d
       explicitly by the calling program.  It can be returned  to  the  system
       implicitly  by  calling  re_release() or re_free() on the associated RE
       object.

       The following example illustrates the representation and management  of
       submatches.

EXAMPLE
       This  program  copies  stdin  to  stdout  while inserting ANSI terminal
       escape sequences to make each number (non-empty sequence of digits)  in
       the input be printed bold and in red on the output.

           #include <stdio.h>
           #include <re.h>

           int main(int argc, char **argv)
           {
             static struct re re= RE_INITIALISER("(.*?{[0-9]+})*");
             char buf[1024];
             while (fgets(buf, sizeof(buf), stdin)) {
               if (re_match(&re, buf) < 0) { /* error */
                 fprintf(stderr, "%s\n", re.error);
                 break;
               }
               else {
                 char *p= buf;
                 int n= 0;
                 while (*p) {
                   if (n < re.nmatches && p == re.matches[n]) {
                     if (n & 1)              /* end: clear attributes */
                       printf("\033[0m");
                     else
                       printf("\033[1;31m"); /* start: bold, red */
                     ++n;
                   }
                   putchar(*p++);
                 }
               }
             }
             re_release(&re);
             return 0;
           }

       The regular expression

           (.*?{[0-9]+})*

       begins by ignoring as few characters as possible (.*?)  before matching
       as many consecutive digits ([0-9]+) as possible.  Each sequence of dig-
       its  that  is matched is then grouped within curly braces ({[0-9]+}) to
       form a submatch that will be placed in the  matches  array  of  the  re
       object.   The whole pattern is parenthesized and repeated as many times
       as possible ((.?*{[0-9]+})*) so as to identify all the numbers  in  the
       input.

       After  a successful match the line is written to the output one charac-
       ter at a time.  If the current position corresponds to the  next  even-
       numbered  pointer  in  matches then the character begins a submatch and
       the terminal control sequence for bold, red text is output  before  the
       character.   If  the  current  position  equals  the  next odd-numbered
       pointer then it begins the text following (and not  including)  a  sub-
       match,  and  the terminal control sequence to clear all text attributes
       is output before the character.

UTILITIES
       The re_escape() function rewrites its string  argument  in-place,  con-
       verting  standard C escape sequences into the characters that they rep-
       resent.  It is provided as a convenient  means  to  place  non-printing
       characters into expressions read from a terminal or file.  The standard
       C escape sequences are recognised:

           \NNN        ASCII character in octal notation

           \\          backslash

           \a          alert or 'BEL' (and if this actually rings a  bell  for
                       you  then  you should seriously consider upgrading your
                       peripherals)

           \b          backspace

           \f          form feed

           \n          new line

           \r          carriage return

           \t          horizontal tab

           \UNNNNNNNN  32-bit (8-digit) Unicode character

           \uNNNN      16-bit (4-digit) Unicode character

           \v          vertical tab

           \xNN        ASCII character in hexadecimal notation

       The liberal argument to re_escape() affects its  treatment  of  escaped
       characters other than those shown above.  For any escaped character not
       in the above list, if liberal is zero then then both the escape charac-
       ter (the backslash) and the escaped character following it will be pre-
       served verbatim in the re-written string.  If liberal is non-zero  then
       the  backslash character will be deleted and only the character follow-
       ing it preserved verbatim in the result.   (For  the  sequences  listed
       above, the initial backslash will always be deleted from the re-written
       string regardless of the value of liberal.)

       For the two Unicode escape sequences, '\u' and '\U', if  the  resulting
       character  is wider than 7 bits (i.e., beyond the 127 ASCII characters)
       then its UTF-8 multibyte representation will be placed in the  re-writ-
       ten string.

USAGE AND CUSTOMISATION
       The  lwre  implementation  consist of a single source file and a single
       header file.  If lwre is installed as a library, add

           #include <lwre.h>

       in your source file and add

           -llwre

       to your compile or link command.  In this case no customisation is pos-
       sible.

       You  can  also  include  the  lwre  header  and source file in your own
       project sources, and then add

           #include <lwre.c>

       to one of your source files.  This will pull the  entire  implemenation
       into  your  file.   In  this  case several cpp(1) macros can be defined
       (before including lwre.c, obviously) to customise its behaviour.

       RE_MEMBERS
           If defined, this macro should expand to one or more structure  mem-
           ber  declarations.   These members will be added to the declaration
           of struct re for the convenience of the calling program.

       RE_MALLOC(RE, SIZE)
           If defined, this macro should expand to an expression that  behaves
           like  malloc(3).   The first argument is a pointer to the struct re
           associated with the ongoing compilation or  matching  process,  and
           the  second  is  the usual size_t argument indicating the number of
           bytes to allocate.

       RE_CALLOC(RE, NMEMB, SIZE)
           If defined, this macro should expand to an expression that  behaves
           like  calloc(3).   The  first  argument is a pointer to the current
           struct re object.

       RE_REALLOC(RE, PTR, SIZE)
           If defined, this macro should expand to an expression that  behaves
           like  realloc(3).   The  first argument is a pointer to the current
           struct re object.

       RE_FREE(RE)
           If defined, this macro should expand to a  statement  that  behaves
           like  free(3).   The  first  argument  is  a pointer to the current
           struct re object.

       Error recovery from within the above expressions can  be  performed  by
       invoking  the  macro RE_ERROR(RE, CODE, MESSAGE) where RE is the struct
       re object, CODE is a numeric error code, and MESSAGE is an  explanatory
       string.  The RE_ERROR macro can also be redefined by the client program
       before including lwre.c, if desired.  A pointer to a jmp_buf is  avail-
       able  in  RE->error_env  to terminate the active call to re_match early
       via longjmp(3) and return control to the client program.   The  default
       definition of RE_ERROR is

           {
             (RE)->error_code = (CODE);
             (RE)->error_message = (MESSAGE);
             longjmp(*(RE)->error_env, (CODE));
           }

       If  the  memory allocator macros are not defined then they will default
       to implementations that check for memory exhaustion (a zero result from
       any  of  the  allocation  functions) and then return immediately to the
       calling program by invoking RE_ERROR.  Note that in  case  of  error  a
       non-zero error code must be stored in re->error_code since parts of the
       implementation depend on it for detecting an error condition.

       Note also that the entire state associated with compiling and  matching
       regular  expressions  is  stored in the struct re object.  If your sys-
       tem's memory allocation routines (and/or those which  you  provide  via
       the  above  macros)  is thread-safe, then lwre should be re-entrant and
       thread-safe too.

CAVEATS
       The re_match() function does not interpret escape sequences other  than
       to  treat  a character following backslash literally.  You have to rely
       on the C compiler (or the re_escape() function) to create  control  and
       other non-printing characters for you within regular expression.  Occa-
       sionally the C compiler will attempt to  steal  backslashes  from  your
       string  literals in cases where you want them preserved.  This leads to
       situations where you may have to write '\\' (or even '\\\\') to specify
       a single literal backslash character in your regular expression.

       Searches  are  implicitly  anchored  at  the start of the input string.
       They will not find a match in the middle of the string unless '.*?'  is
       used  at  the  beginning of the regular expression to explicitly 'unan-
       chor' the search.  Conversely, the search is not anchored to the end of
       the  string;  if necessary the caller can inspect the value returned by
       re_match() to ensure that the entire string was matched.   If  you  are
       upset about either of these 'limitations', you should not be.

       Meta (operator and grouping) characters in regular expressions are only
       treated as meta characters when they occur in positions where they  are
       expected  and  make sense, otherwise they are treated as normal charac-
       ters.  The regular expression 'a**' therefore  matches  a  sequence  of
       zero  or  more  'a's followed by a single '*' character.  If you really
       want to repeat the repetition, which is pointless but mostly  harmless,
       you  could  write  '(a*)*'  instead.   (This  will  slow  down matching
       slightly since no  attempt  is  made  to  optimise  suboptimal  regular
       expressions.)

       The  regular  expression  is  stored  in the RE object as an accessible
       field.  It is compiled lazily on the first  call  to  re_match().   The
       memory  containing  the  expression  must  not  be modified or free(3)d
       between the initialisation of the RE  object  and  the  first  call  to
       re_match().   Changes made to the memory pointed to by expression after
       the first call to re_match() will have no effect on regular  expression
       matching.   To  force  recompilation of the expression in a reusable RE
       object the program must call re_reset().

BUGS
       There is no way to change the 're_' prefix on functions, or the name of
       the 're' structure, that does not involve editing the source code.

       The  curly brace syntax for submatches conflicts with potential support
       for the common syntax 'e{N,M}' (meaning 'at  least  N  and  at  most  M
       occurences of e', subsuming the '?', '*' and '+' operators).

       Both  the  regular expression and the string to match within are termi-
       nated with a NUL character.  It would be more general  (allowing  arbi-
       trary  binary  data to be matched) if expression and string were speci-
       fied as a base address plus a length.

       There is no wide character interface.  (UTF-8 sequences within  expres-
       sions and input strings should work as expected.)

       Please report anything suspicious (with this software) to the author at
       the address below.  (Anything else that seems suspicious is almost cer-
       tainly  best  kept  to  yourself.  The fearmongers really don't need to
       know about the weird stuff your  neighbours  do  for  thrills,  or  the
       bizarre  contraptions  they leave lying about.  Your worrying energy is
       far better reserved for the real bogeymen, who wait under your  bed  to
       sneak  out  and steal your socks after your parents turn out the lights
       at night.  The usual workaround is to install a padlock on  your  chest
       of drawers.)

QUOTATIONS
       "I  define  Unix  as 30 definitions of regular expressions living under
       one roof."  (Donald Knuth, Digital Typography, 1999)

SEE ALSO
       regex(3), pcre(3), utf-8(7)

AUTHOR
       The software and this manual page were written by Ian Piumarta  (first-
       name at last-name dot com) while investigating how a modern programming
       language might be made as interactive, fun and easy to use  for  begin-
       ners  as was (for example) BASIC on a typical home microcomputer of the
       early 1980s.

       Please send bug reports and suggestions for improvements to the  author
       at the above address.



Version 0.0                      January 2014                          LWRE(3)