LWRE(3) LWRE(3) NAME lwre - lightweight regular expression matching SYNOPSIS #include <lwre.h> struct re RE_INITIALISER(expression); struct re *re_new(char *expression); int re_match(struct re *re, char *string); void re_reset(struct re *re, char *expression); void re_release(struct re *re); void re_free(struct re *re); char *re_escape(char *string, int liberal); INTRODUCTION liblwre is a lightweight regular expression library intended for the occasions when you want to search in text with a minimum of fuss, bloat and complexity. Compared to other common solutions it is ten times smaller than the POSIX regex(3) functions and more than thirty-five times smaller than the popular pcre(3) package. It has far fewer bells, whistles, dials and guages, but in return offers the simplest possible interface, very reasonable performance, and freedom from the exponential complexity that renders backtracking implementations use- less for certain kinds of regular expression. DESCRIPTION The re_new() function allocates a new regular expression (RE) object that will match the given regular expression and returns a pointer to it. The RE object contains the following publicly-accessible fields: struct re { char *expression; /* the associated regular expression */ char *error_code; /* an error code, or zero */ char *error_message; /* an error message, or NULL */ char *position; /* character after error or match */ char **matches; /* pointers to submatches, or NULL */ int nmatches; /* the number of submatches, or zero */ }; The re_match() function matches the given string against the expression associated with the re object and returns the number of characters that were sucessfully matched, or -1 if no match was found. After a suc- cessful match it sets the matches field to an array of character point- ers identifying the submatches found within the input string and sets nmatches to the number of pointers in the matches array. (See SUB- MATCHES for details.) The re_reset() function associates a new regular expression with an existing re object. The re_release() function releases all resources associated with an re object but not the object itself. (The object can be reused by calling re_reset.) The re_free() function releases all resources associated with the re object, including re itself. The RE_INITIALISER() macro can be used to initialise the contents of an RE object that is allocated in static or automatic storage. The pro- gram can call BR re_release () on such objects when they are no longer needed, and re_reset() to reuse them with a different expression, but it must never call re_free() for them. The re_escape() function rewrites the contents of the given string to replace any escape sequences with the character that they represent. (It is not involved in the regular expression matching achinery but may prove useful when reading regular expressions from a file to convert escape sequences into non-printing characters within those expres- sions.) RETURN VALUE The re_new() function returns a pointer to the newly-allocated RE object, or NULL if sufficient memory cannot be obtained from the sys- tem. The re_match() function returns a non-negative result indicating the number of characters successfully matched (which might be zero if the regular expression permits that) or a negative error code (and sets the error field of re to a string describing the error and the position field of re to the location of the problematic character in the associ- ated expression). The RE_INITIALISER() macro expands to a structure initialiser expres- sion corresponding to an RE object containing the given regular expres- sion. The re_escape() function always returns its first argument. DIAGNOSTICS The re_match() function will fail if the regular expression is mal- formed or if it does not match the given string. In case of failure it returns a negative error code and sets the error field of the RE object to a string describing the problem. Symbolic constants are defined for the error codes as follows: RE_ERROR_NONE No error (zero). RE_ERROR_NOMATCH The regular expression did not match the string (-1). The posi- tion field of the RE object will be set to the location in string at which a match became impossible. RE_ERROR_NOMEM A system memory allocation routine (typically one of malloc(), calloc() or realloc()) failed and returned zero. RE_ERROR_CHARSET The final ']' in a character set was not found. Either it was missing entirely or a group delimited by parentheses or braces ended before the ']' was seen. RE_ERROR_SUBEXP The final ')' in a parenthesized subexpression was not found. Either it was missing entirely or a group delimited by braces ended before the ')' was seen. RE_ERROR_SUBMATCH The final '}' in a brace-delimited submatch was not found. Either it was missing entirely or a subexpression delimited by parentheses ended before the '}' was seen. RE_ERROR_ENGINE An internal error occurred in the matching engine. This indi- cates a serious problem with the implementation and should be reported as a bug. REGULAR EXPRESSIONS The regular expressions understood by lwre are composed of the follow- ing elements: a Any non-meta character matches itself. \a An escaped character matches the character (in this example 'a') even if it would otherwise be a meta character (therefore use '\\' to match a literal backslash). . Dot matches any character (use '\.' to match a literal dot). [set] A character class matches any single character from the given set (use '\[' in an expression to match a literal left square bracket). The set consists of one or more character specifica- tions as follows: a Any non-meta character represents itself literally. \a An escaped character matches itself (even if it would oth- erwise be interpreted as a meta character). x-y A hyphen between two characters denotes the range of char- acters that includes both characters and all characters in between them. Either or both of the characters can be escaped, but the hyphen must not be. If the first charac- ter lexicographically follows the second then the range contains only the first character. (To include a hyphen character in the set either put the hypen first in the set, or use '\-'.) ^ An initial circumflex negates the contents of the set, which then implicitly contains every character not explic- itly listed. The set '[^0-9]' therefore matches any non- digit character. (To include the circumflex character in the set either place it anywhere but the beginning of the set, or use '\^'.) \] An escaped right square bracket matches a right square bracket (even in a position that would otherwise delimit the end of the set). Let e denote any regular expression, including the single-character expressions described above. Expressions are combined into complex forms using the following operators (which are listed in precedence order from lowest to highest): e1e2 Concatenation of two expressions yields an expression that matches the first subexpression and then the second subexpres- sion. e1|e2 Alternation (the vertical bar character) matches either the first expression or the second expression. If the two expressions would match simultaneously, the first has priority. (Use '\|' to match a literal vertical bar character.) e? The postfix '?' operator causes the immediately-preceding subex- presion to be optional: it will match zero or once. e* The postfix '*' operator causes the immediately-preceding subex- presion to be matched any number of times (including none). e+ The postfix '+' operator causes the immediately-preceding subex- presion to be matched one or more times. These three postfix operators are 'greedy'; if given a chance, starting with the leftmost, they each will consume as much input as possible providing the rest of the regular expression continues to match the input. To make them 'lazy' (consume as little of the input as possi- ble, provided the rest of the regular expression continues to match the input) follow them with an additional '?' character; thus: e?? matches e zero or one time, preferring zero times if possible; e*? matches e zero or more times, preferring zero times if possible; e+? matches e one or more times, preferring once if possible. All six postfix operators have the same precedence. Expressions can be grouped either to modify precedence or to record submatch ranges. (e) Parentheses group one or more expressions together. The result acts as a single atomic expression. {e} Curly braces behave like parentheses except that the portion of the input string matched by the expression(s) between the left and right brace will be recorded as a submatch in the matches field of the RE object. (See SUBMATCHES for details.) Parentheses and braces can be nested and must be correctly balanced. SUBMATCHES After a successful match, each successive adjecent pair of pointers in the matches field of the RE object identifies the portion of the input string corresponding to each successive submatch indicated in the regu- lar expression. These pairs of pointers follow the usual C convention of being inclusive of the first character in the string and exclusive of the last character: the second pointer of each pair points to the character after the last chacracter of the input string belonging to a submatch. In other words, the nth submatch begins at matches[2 * n] and ends at the character before matches[2 * n + 1] and is of length matches[2 * n + 1] - matches[2 * n] for 0 <= n < nmatches/2. The memory pointed to by matches is managed by the RE implementation and will be invalidated on the next call to any re_*() function. This memory may not be allocated using malloc() and should never be free()d explicitly by the calling program. It can be returned to the system implicitly by calling re_release() or re_free() on the associated RE object. The following example illustrates the representation and management of submatches. EXAMPLE This program copies stdin to stdout while inserting ANSI terminal escape sequences to make each number (non-empty sequence of digits) in the input be printed bold and in red on the output. #include <stdio.h> #include <re.h> int main(int argc, char **argv) { static struct re re= RE_INITIALISER("(.*?{[0-9]+})*"); char buf[1024]; while (fgets(buf, sizeof(buf), stdin)) { if (re_match(&re, buf) < 0) { /* error */ fprintf(stderr, "%s\n", re.error); break; } else { char *p= buf; int n= 0; while (*p) { if (n < re.nmatches && p == re.matches[n]) { if (n & 1) /* end: clear attributes */ printf("\033[0m"); else printf("\033[1;31m"); /* start: bold, red */ ++n; } putchar(*p++); } } } re_release(&re); return 0; } The regular expression (.*?{[0-9]+})* begins by ignoring as few characters as possible (.*?) before matching as many consecutive digits ([0-9]+) as possible. Each sequence of dig- its that is matched is then grouped within curly braces ({[0-9]+}) to form a submatch that will be placed in the matches array of the re object. The whole pattern is parenthesized and repeated as many times as possible ((.?*{[0-9]+})*) so as to identify all the numbers in the input. After a successful match the line is written to the output one charac- ter at a time. If the current position corresponds to the next even- numbered pointer in matches then the character begins a submatch and the terminal control sequence for bold, red text is output before the character. If the current position equals the next odd-numbered pointer then it begins the text following (and not including) a sub- match, and the terminal control sequence to clear all text attributes is output before the character. UTILITIES The re_escape() function rewrites its string argument in-place, con- verting standard C escape sequences into the characters that they rep- resent. It is provided as a convenient means to place non-printing characters into expressions read from a terminal or file. The standard C escape sequences are recognised: \NNN ASCII character in octal notation \\ backslash \a alert or 'BEL' (and if this actually rings a bell for you then you should seriously consider upgrading your peripherals) \b backspace \f form feed \n new line \r carriage return \t horizontal tab \UNNNNNNNN 32-bit (8-digit) Unicode character \uNNNN 16-bit (4-digit) Unicode character \v vertical tab \xNN ASCII character in hexadecimal notation The liberal argument to re_escape() affects its treatment of escaped characters other than those shown above. For any escaped character not in the above list, if liberal is zero then then both the escape charac- ter (the backslash) and the escaped character following it will be pre- served verbatim in the re-written string. If liberal is non-zero then the backslash character will be deleted and only the character follow- ing it preserved verbatim in the result. (For the sequences listed above, the initial backslash will always be deleted from the re-written string regardless of the value of liberal.) For the two Unicode escape sequences, '\u' and '\U', if the resulting character is wider than 7 bits (i.e., beyond the 127 ASCII characters) then its UTF-8 multibyte representation will be placed in the re-writ- ten string. USAGE AND CUSTOMISATION The lwre implementation consist of a single source file and a single header file. If lwre is installed as a library, add #include <lwre.h> in your source file and add -llwre to your compile or link command. In this case no customisation is pos- sible. You can also include the lwre header and source file in your own project sources, and then add #include <lwre.c> to one of your source files. This will pull the entire implemenation into your file. In this case several cpp(1) macros can be defined (before including lwre.c, obviously) to customise its behaviour. RE_MEMBERS If defined, this macro should expand to one or more structure mem- ber declarations. These members will be added to the declaration of struct re for the convenience of the calling program. RE_MALLOC(RE, SIZE) If defined, this macro should expand to an expression that behaves like malloc(3). The first argument is a pointer to the struct re associated with the ongoing compilation or matching process, and the second is the usual size_t argument indicating the number of bytes to allocate. RE_CALLOC(RE, NMEMB, SIZE) If defined, this macro should expand to an expression that behaves like calloc(3). The first argument is a pointer to the current struct re object. RE_REALLOC(RE, PTR, SIZE) If defined, this macro should expand to an expression that behaves like realloc(3). The first argument is a pointer to the current struct re object. RE_FREE(RE) If defined, this macro should expand to a statement that behaves like free(3). The first argument is a pointer to the current struct re object. Error recovery from within the above expressions can be performed by invoking the macro RE_ERROR(RE, CODE, MESSAGE) where RE is the struct re object, CODE is a numeric error code, and MESSAGE is an explanatory string. The RE_ERROR macro can also be redefined by the client program before including lwre.c, if desired. A pointer to a jmp_buf is avail- able in RE->error_env to terminate the active call to re_match early via longjmp(3) and return control to the client program. The default definition of RE_ERROR is { (RE)->error_code = (CODE); (RE)->error_message = (MESSAGE); longjmp(*(RE)->error_env, (CODE)); } If the memory allocator macros are not defined then they will default to implementations that check for memory exhaustion (a zero result from any of the allocation functions) and then return immediately to the calling program by invoking RE_ERROR. Note that in case of error a non-zero error code must be stored in re->error_code since parts of the implementation depend on it for detecting an error condition. Note also that the entire state associated with compiling and matching regular expressions is stored in the struct re object. If your sys- tem's memory allocation routines (and/or those which you provide via the above macros) is thread-safe, then lwre should be re-entrant and thread-safe too. CAVEATS The re_match() function does not interpret escape sequences other than to treat a character following backslash literally. You have to rely on the C compiler (or the re_escape() function) to create control and other non-printing characters for you within regular expression. Occa- sionally the C compiler will attempt to steal backslashes from your string literals in cases where you want them preserved. This leads to situations where you may have to write '\\' (or even '\\\\') to specify a single literal backslash character in your regular expression. Searches are implicitly anchored at the start of the input string. They will not find a match in the middle of the string unless '.*?' is used at the beginning of the regular expression to explicitly 'unan- chor' the search. Conversely, the search is not anchored to the end of the string; if necessary the caller can inspect the value returned by re_match() to ensure that the entire string was matched. If you are upset about either of these 'limitations', you should not be. Meta (operator and grouping) characters in regular expressions are only treated as meta characters when they occur in positions where they are expected and make sense, otherwise they are treated as normal charac- ters. The regular expression 'a**' therefore matches a sequence of zero or more 'a's followed by a single '*' character. If you really want to repeat the repetition, which is pointless but mostly harmless, you could write '(a*)*' instead. (This will slow down matching slightly since no attempt is made to optimise suboptimal regular expressions.) The regular expression is stored in the RE object as an accessible field. It is compiled lazily on the first call to re_match(). The memory containing the expression must not be modified or free(3)d between the initialisation of the RE object and the first call to re_match(). Changes made to the memory pointed to by expression after the first call to re_match() will have no effect on regular expression matching. To force recompilation of the expression in a reusable RE object the program must call re_reset(). BUGS There is no way to change the 're_' prefix on functions, or the name of the 're' structure, that does not involve editing the source code. The curly brace syntax for submatches conflicts with potential support for the common syntax 'e{N,M}' (meaning 'at least N and at most M occurences of e', subsuming the '?', '*' and '+' operators). Both the regular expression and the string to match within are termi- nated with a NUL character. It would be more general (allowing arbi- trary binary data to be matched) if expression and string were speci- fied as a base address plus a length. There is no wide character interface. (UTF-8 sequences within expres- sions and input strings should work as expected.) Please report anything suspicious (with this software) to the author at the address below. (Anything else that seems suspicious is almost cer- tainly best kept to yourself. The fearmongers really don't need to know about the weird stuff your neighbours do for thrills, or the bizarre contraptions they leave lying about. Your worrying energy is far better reserved for the real bogeymen, who wait under your bed to sneak out and steal your socks after your parents turn out the lights at night. The usual workaround is to install a padlock on your chest of drawers.) QUOTATIONS "I define Unix as 30 definitions of regular expressions living under one roof." (Donald Knuth, Digital Typography, 1999) SEE ALSO regex(3), pcre(3), utf-8(7) AUTHOR The software and this manual page were written by Ian Piumarta (first- name at last-name dot com) while investigating how a modern programming language might be made as interactive, fun and easy to use for begin- ners as was (for example) BASIC on a typical home microcomputer of the early 1980s. Please send bug reports and suggestions for improvements to the author at the above address. Version 0.0 January 2014 LWRE(3)