diff options
Diffstat (limited to 'doc/tre-syntax.html')
-rw-r--r-- | doc/tre-syntax.html | 433 |
1 files changed, 433 insertions, 0 deletions
diff --git a/doc/tre-syntax.html b/doc/tre-syntax.html new file mode 100644 index 0000000000000..13ccb5494db28 --- /dev/null +++ b/doc/tre-syntax.html @@ -0,0 +1,433 @@ +<h1>TRE Regexp Syntax</h1> + +<p> +This document describes the POSIX 1003.2 extended RE (ERE) syntax and +the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions +to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style +notation is used to describe the grammar. +</p> + +<h2>ERE Syntax</h2> + +<h3>Alternation operator</h3> +<a name="alternation"></a> +<a name="extended-regexp"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a> + | <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a> +</pre> +</td></tr> +</table> +<p> +An extended regexp (ERE) is one or more <i>branches</i>, separated by +<tt>|</tt>. An ERE matches anything that matches one or more of the +branches. +</p> + +<h3>Catenation of REs</h3> +<a name="catenation"></a> +<a name="branch"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>branch</i> ::= <i>piece</i> + | <i>branch</i> <i>piece</i> +</pre> +</td></tr> +</table> +<p> +A branch is one or more <i>pieces</i> concatenated. It matches a +match for the first piece, followed by a match for the second piece, +and so on. +</p> + + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>piece</i> ::= <i>atom</i> + | <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a> + | <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a> +</pre> +</td></tr> +</table> +<p> +A piece is an <i>atom</i> possibly followed by a repeat operator or an +expression controlling approximate matching parameters for the <i>atom</i>. +</p> + + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b> + | <a href="#bracket-expression"><i>bracket-expression</i></a> + | <b>"."</b> + | <a href="#assertion"><i>assertion</i></a> + | <a href="#literal"><i>literal</i></a> + | <a href="#backref"><i>back-reference</i></a> + | <b>"(?#"</b> <i>comment-text</i> <b>")"</b> + | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i> + | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b> +</pre> +</td></tr> +</table> +<p> +An atom is either an ERE enclosed in parenthesis, a bracket +expression, a <tt>.</tt> (period), an assertion, or a literal. +</p> + +<p> +The dot (<tt>.</tt>) matches any single character. +If the <code>REG_NEWLINE</code> compilation flag (see <a +href="api.html">API manual</a>) is specified, the newline +character is not matched. +</p> + +<p> +<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is +completely ignored by the regex parser and it used solely for readability purposes. +</p> + +<h3>Repeat operators</h3> +<a name="repeat-operator"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>repeat-operator</i> ::= <b>"*"</b> + | <b>"+"</b> + | <b>"?"</b> + | <i>bound</i> + | <b>"*?"</b> + | <b>"+?"</b> + | <b>"??"</b> + | <i>bound</i> <b>?</b> +</pre> +</td></tr> +</table> + +<p> +An atom followed by <tt>*</tt> matches a sequence of 0 or more matches +of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence +of 1 or more matches of the atom. An atom followed by <tt>?</tt> +matches a sequence of 0 or 1 matches of the atom. +</p> + +<p> +A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i> +are unsigned decimal integers between <tt>0</tt> and +<tt>RE_DUP_MAX</tt>: +</p> + +<ol> +<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li> +<li><tt>{</tt><i>m</i><tt>,}</tt></li> +<li><tt>{</tt><i>m</i><tt>}</tt></li> +</ol> + +<p> +An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i> +(inclusive) matches of the atom. An atom followed by [2] +matches a sequence of <i>m</i> or more matches of the atom. An atom +followed by [3] matches a sequence of exactly <i>m</i> matches of the +atom. +</p> + + +<p> +Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or +non-greedy. Normally a repeated expression is greedy, that is, it matches as +many characters as possible. A non-greedy subexpression matches as few +characters as possible. Note that this does not (always) mean the same thing +as matching as many or few repetitions as possible. Also note +that <strong>minimal repetitions are not currently supported for approximate +matching</strong>. +</p> + +<h3>Approximate matching settings</h3> +<a name="approx-settings"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b> + +<i>count-limits</i> ::= <b>"+"</b> <i>number</i>? + | <b>"-"</b> <i>number</i>? + | <b>"#"</b> <i>number</i>? + | <b>"~"</b> <i>number</i>? + +<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i> + +<i>cost-term</i> ::= <i>number</i> <b>"i"</b> + | <i>number</i> <b>"d"</b> + | <i>number</i> <b>"s"</b> + +</pre> +</td></tr> +</table> + +<p> +The approximate matching settings for a subpattern can be changed +by appending <i>approx-settings</i> to the subpattern. Limits for +the number of errors can be set and an expression for specifying and +limiting the costs can be given. +</p> + +<p> +The <i>count-limits</i> can be used to set limits for the number of +insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions +(<tt>#</tt>), and total number of errors (<tt>~</tt>). If the +<i>number</i> part is omitted, the specified error count will be +unlimited. +</p> + +<p> +The <i>cost-equation</i> can be thought of as a mathematical equation, +where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of +insertions, deletions, and substitutions, respectively. The equation +can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and +<tt>s</tt>. The multiplier is the cost of the error, and the number +after <tt><</tt> is the maximum allowed cost of a match. Spaces +and pluses can be inserted to make the equation readable. In fact, when +specifying only a cost equation, adding a space after the opening <tt>{</tt> +is <strong>required</strong>. +</p> + +<p> +Examples: +<dl> +<dt><tt>{~}</tt></dt> +<dd>Sets the maximum number of errors to unlimited.</dd> +<dt><tt>{~3}</tt></dt> +<dd>Sets the maximum number of errors to three.</dd> +<dt><tt>{+2~5}</tt></dt> +<dd>Sets the maximum number of errors to five, and the maximum number +of insertions to two.</dd> +<dt><tt>{<3}</tt></dt> +<dd>Sets the maximum cost to three. +<dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt> +<dd>Sets the cost of an insertion to two, a deletion to one, a +substitution to two, and the maximum cost to five. +</dl> + + +<h3>Bracket expressions</h3> +<a name="bracket-expression"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b> + | <b>"[^"</b> <i>item</i>+ <b>"]"</b> +</pre> +</td></tr> +</table> + +<p> +A bracket expression specifies a set of characters by enclosing a +nonempty list of items in brackets. Normally anything matching any +item in the list is matched. If the list begins with <tt>^</tt> the +meaning is negated; any character matching no item in the list is +matched. +</p> + +<p> +An item is any of the following: +</p> +<ul> +<li>A single character, matching that character.</li> +<li>Two characters separated by <tt>-</tt>. This is shorthand for the +full range of characters between those two (inclusive) in the +collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any +decimal digit.</li> +<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>, +matching the collating element. This can be used to include a literal +<tt>-</tt> or a multi-character collating element in the list.</li> +<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an +equivalence class), matching all collating elements with the same +primary collation weight as that element, including the element +itself.</li> +<li>The name of a character class enclosed in <tt>[:</tt> and +<tt>:]</tt>, matching any character belonging to the class. The set +of valid names depends on the <code>LC_CTYPE</code> category of the +current locale, but the following names are valid in all locales: +<ul> +<li><tt>alnum</tt> - alphanumeric characters</li> +<li><tt>alpha</tt> - alphabetic characters</li> +<li><tt>blank</tt> - blank characters</li> +<li><tt>cntrl</tt> - control characters</li> +<li><tt>digit</tt> - decimal digits (0 through 9)</li> +<li><tt>graph</tt> - all printable characters except space</li> +<li><tt>lower</tt> - lower-case letters</li> +<li><tt>print</tt> - printable characters including space</li> +<li><tt>punct</tt> - printable characters not space or alphanumeric</li> +<li><tt>space</tt> - white-space characters</li> +<li><tt>upper</tt> - upper case letters</li> +<li><tt>xdigit</tt> - hexadecimal digits</li> +</ul> +</ul> +<p> +To include a literal <tt>-</tt> in the list, make it either the first +or last item, the second endpoint of a range, or enclose it in +<tt>[.</tt> and <tt>.]</tt> to make it a collating element. To +include a literal <tt>]</tt> in the list, make it either the first +item, the second endpoint of a range, or enclose it in <tt>[.</tt> and +<tt>.]</tt>. To use a literal <tt>-</tt> as the first +endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>. +</p> + + +<h3>Assertions</h3> +<a name="assertion"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>assertion</i> ::= <b>"^"</b> + | <b>"$"</b> + | <b>"\"</b> <i>assertion-character</i> +</pre> +</td></tr> +</table> + +<p> +The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and +"right anchor", respectively. The left anchor matches the empty +string at the beginning of the string. The right anchor matches the +empty string at the end of the string. The behaviour of both anchors +can be varied by specifying certain execution and compilation flags; +see the <a href="api.html">API manual</a>. +</p> + +<p> +An assertion-character can be any of the following: +</p> + +<ul> +<li><tt><</tt> - Beginning of word +<li><tt>></tt> - End of word +<li><tt>b</tt> - Word boundary +<li><tt>B</tt> - Non-word boundary +<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li> +<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li> +<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li> +<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li> +<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li> +<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li> +</ul> + + +<h3>Literals</h3> +<a name="literal"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>literal</i> ::= <i>ordinary-character</i> + | <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2} + | <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b> + | <b>"\"</b> <i>character</i> +</pre> +</td></tr> +</table> +<p> +A literal is either an ordinary character (a character that has no +other significance in the context), an 8 bit hexadecimal encoded +character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character +(e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped +character is a <tt>\</tt> followed by any character, and matches that +character. Escaping can be used to match characters which have a +special meaning in regexp syntax. A <tt>\</tt> cannot be the last +character of an ERE. Escaping also allows you to include a few +non-printable characters in the regular expression. These special +escape sequences include: +</p> + +<ul> +<li><tt>\a</tt> - Bell character (ASCII code 7) +<li><tt>\e</tt> - Escape character (ASCII code 27) +<li><tt>\f</tt> - Form-feed character (ASCII code 12) +<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10) +<li><tt>\r</tt> - Carriage return character (ASCII code 13) +<li><tt>\t</tt> - Horizontal tab character (ASCII code 9) +</ul> + +<p> +An ordinary character is just a single character with no other +significance, and matches that character. A <tt>{</tt> followed by +something else than a digit is considered an ordinary character. +</p> + + +<h3>Back references</h3> +<a name="backref"></a> + +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>] +</pre> +</td></tr> +</table> +<p> +A back reference is a backslash followed by a single non-zero decimal +digit <i>d</i>. It matches the same sequence of characters +matched by the <i>d</i>th parenthesized subexpression. +</p> + +<p> +Back references are not defined for POSIX EREs (for BREs they are), +but many matchers, including TRE, implement back references for both +EREs and BREs. +</p> + +<h3>Options</h3> +<a name="options"></a> +<table bgcolor="#e0e0f0" cellpadding="10"> +<tr><td> +<pre> +<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)? +</pre> +</td></tr> +</table> + +Options allow compile time options to be turned on/off for particular parts of the +regular expression. The options equate to several compile time options specified to +the regcomp API function. If the option is specified in the first section, it is +turned on. If it is specified in the second section (after the <tt>-</tt>), it is +turned off. +<ul> +<li>i - Case insensitive. +<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in +the <a href="tre-api.html">API Manual</a>. +<li>r - Causes the regex to be matched in a right associative manner rather than the normal +left associative manner. +<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended. +</ul> +<h2>BRE Syntax</h2> + +<p> +The obsolete basic regexp (BRE) syntax differs from the ERE syntax as +follows: +</p> + +<ul> +<li><tt>|</tt> is an ordinary character, and there is no equivalent +for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary +characters.</li> +<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with +<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li> +<li>The parentheses for nested subexpressions are <tt>\(</tt> and +<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary +characters.</li> +<li><tt>^</tt> is an ordinary character except at the beginning of the +RE or the beginning of a parenthesized subexpression. Similarly, +<tt>$</tt> is an ordinary character except at the end of the +RE or the end of a parenthesized subexpression.</li> +</ul> |