|
==== RegExpr: (not just) a C# wrapper for System.Text.RegularExpressions.Regex
|
|
The purpose of this module is to allow:
|
|
1. Defining regular expressions as C# code instead of plain-old-strings;
|
|
2. Marking elements of regular expressions as tokens, allowing captured text to be accessed
|
and manipulated through token IDs;
|
|
3. Creating token production rules that specify how to process the captured tokens.
|
|
|
== 0. "TL;DR"
|
|
* Regular expressions can be written as C# statements without any additional pre-processing.
|
|
* A token definition within a regular expression allows matched text to be captured.
|
|
* Tokens can include production rules that calculate an output object when matching the token.
|
|
* Only one rule from the list of available rules in a token will be selected during parsing.
|
|
* A rule can define a list of actions to be executed in sequence when that rule is selected.
|
|
* Parser output will include all objects created by production rule actions.
|
|
|
== 1. Regular expressions as C# statements
|
|
The classes in this module can be instantiated using C# statements to specify regular expressions
|
that are checked at compile-time, unlike plain-old-strings. Specifying reg-ex'es directly in C#
|
will potentially also make them more readable and maintainable.
|
|
The following class hierarchy provides abstract representations of regular expressions:
|
|
abstract RegExpr . . . . . . . . Base class of the regular expression abstraction
|
^
|
|
|
+--+ abstract CharClass . . . . Match one character of a class of characters
|
| ^
|
| |
|
| +--+ CharClassLiteral . . . Match one character of a list of characters
|
| |
|
| +--+ CharClassRange . . . . Match one character of a range of characters
|
| |
|
| +--+ CharClassSet . . . . . Match one character of a set of character classes
|
|
|
+--+ RegExprLiteral . . . . . . Match a sequence of characters
|
|
|
+--+ RegExprRepeat . . . . . . . Match the same pattern repeatedly
|
|
|
+--+ RegExprSequence . . . . . . Match several patterns in sequence
|
|
|
+--+ RegExprChoice . . . . . . . Match one of several alternative patterns
|
|
|
+--+ RegExprAssert . . . . . . . Assert a pattern but do not consume any characters
|
|
|
+--+ Token . . . . . . . . . . . Capture and process text matched by a pattern
|
|
The following syntax can be used to specify regular expressions in C# using RegExpr classes
|
(the notation _T_ represents an instance of type T):
|
|
Expression Type Description
|
----------------------------------------------------------------------------------------------------
|
CharWord CharClassLiteral Word character (\w)
|
|
CharCr CharClassLiteral Carriage return character (\r)
|
|
CharLf CharClassLiteral Line feed character (\n)
|
|
CharSpace CharClassLiteral Space character (\s)
|
|
CharNonSpace CharClassLiteral Non-space character (\S = [^\s])
|
|
CharVertSpace CharClassSet Vertical space ([\r\n])
|
|
CharHorizSpace CharClassSet Horizontal space ([^\S\r\n])
|
|
Char [ _char_ ] CharClassLiteral Literal character class
|
|
Char [ _string_ ] CharClassLiteral Literal character class
|
|
Char [ _char_ , _char_ ] CharClassRange Range character class
|
|
~ _CharClass_ CharClassSet Negated character class
|
|
CharSet [ _CharClass_ + _CharClass_ ] CharClassSet Combined character class
|
|
CharSet [ _CharClass_ - _CharClass_ ] CharClassSet Character class subtraction
|
|
CharSetRaw [ _string_ ] CharClassLiteral Raw (unescaped) character class
|
|
AnyChar RegExprLiteral Match any character (.)
|
|
StartOfLine RegExprLiteral Anchor for start of line (^)
|
|
EndOfFile RegExprLiteral Anchor for end of input string ($)
|
|
LineBreak RegExprSequence Match a line break (\r?\n)
|
|
RegX ( _string_ ) RegExprLiteral Literal character sequence
|
|
_RegExpr_ .Optional() RegExprRepeat Optional match
|
|
_RegExpr_ .Repeat() RegExprRepeat Match zero or more times
|
|
_RegExpr_ .Repeat ( atLeast: _int_ ) RegExprRepeat Match at least N times
|
|
_RegExpr_ .Repeat ( atMost: _int_ ) RegExprRepeat Match at most N times
|
|
_RegExpr_ .Repeat ( _int_ ) RegExprRepeat Match exactly N times
|
|
_RegExpr_ .Repeat ( _int_ , _int_ ) RegExprRepeat Match between N and M times
|
|
_RegExpr_ & _RegExpr_ RegExprSequence Sequential composition
|
|
_RegExpr_ | _RegExpr_ RegExprChoice Alternating composition
|
|
Assert [ _RegExpr_ ] RegExprAssert Positive assertion (look ahead)
|
|
!Assert [ _RegExpr_ ] RegExprAssert Negative assertion (look ahead)
|
|
Assert [ _RegExpr_ ] > _RegExpr_ RegExprSequence Assert look ahead before match
|
|
_RegExpr_ > Assert [ _RegExpr_ ] RegExprSequence Assert look ahead after match
|
|
Assert [ _RegExpr_ ] < _RegExpr_ RegExprSequence Assert look behind before match
|
|
_RegExpr_ < Assert [ _RegExpr_ ] RegExprSequence Assert look behind after match
|
|
!Assert [ _RegExpr_ ] > _RegExpr_ RegExprSequence Negative look ahead before match
|
|
_RegExpr_ > !Assert [ _RegExpr_ ] RegExprSequence Negative look ahead after match
|
|
!Assert [ _RegExpr_ ] < _RegExpr_ RegExprSequence Negative look behind before match
|
|
_RegExpr_ < !Assert [ _RegExpr_ ] RegExprSequence Negative look behind after match
|
|
RegXRaw ( _string_ ) RegExprLiteral Raw (unescaped) Regex string
|
|
Examples of regular expressions as C# statements:
|
|
RegExpr (C#) Regular Expression
|
----------------------------------------------------------------------
|
Char["abc"] [abc]
|
|
Char['a', 'z'] [a-z]
|
|
~Char["abc"] [^abc]
|
|
CharSet[Char["abc"] + Char['x', 'z']] [abcx-z]
|
|
CharSet[Char['a', 'z'] - Char["aeiou"]] [a-z-[aeiou]]
|
|
CharSetRaw["az-[aeiou]"] [a-z-[aeiou]]
|
|
RegX("abc") abc
|
|
RegX(@"\a\b\c") \\a\\b\\c
|
|
RegXRaw(@"\S\r\n") \S\r\n
|
|
RegX("a").Optional() a?
|
|
RegX("a").Repeat() a*
|
|
RegX("a").Repeat(atLeast: 1) a+
|
|
RegX("a").Repeat(atLeast: 2) a{2,}
|
|
RegX("a").Repeat(atMost: 3) a{,3}
|
|
RegX("a").Repeat(4) a{4}
|
|
RegX("a").Repeat(5, 6) a{5,6}
|
|
RegX("a") & "xyz" axyz
|
|
RegX("a") | "xyz" (?:a)|(?:xyz)
|
|
Assert["abc"] (?=abc)
|
|
Assert[RegX("abc")] > CharWord.Repeat() (?=abc)\w*
|
|
!Assert[RegX("abc") > CharWord.Repeat() (?!abc)\w*
|
|
RegX("abc") > Assert[RegX("xyz")] abc(?=xyz)
|
|
Assert[RegX("abc")] < RegX("xyz") (?<=abc)xyz
|
|
CharWord.Repeat() < Assert[RegX("xyz")] \w*(?<=xyz)
|
|
CharWord.Repeat() < !Assert[RegX("xyz")] \w*(?<!xyz)
|
|
|
== 2. Tokens
|
|
The following statement creates a token based on a RegExpr:
|
|
new Token ( _string_ , _RegExpr_ )
|
|
The string param will be used as an ID to reference the text captured by the RegExpr.
|
The whitespace immediately before a token can be automatically skipped. A RegExpr to match the
|
whitespace can be provided as a default for all tokens, or given specifically to each token:
|
|
new Token ( _string_ , _RegExpr_ , _RegExpr_ )
|
|
The first RegExpr param specifies the pattern of leading whitespace to be skipped for this token.
|
Whitespace skipping can be disabled for specific tokens:
|
|
new Token ( _string_ , SkipWs_Disable , _RegExpr_ )
|
|
The following are examples of token definitions:
|
|
new Token("NUM", Char['0', '9'].Repeat())
|
|
new Token("WORD", Space.Repeat(), CharWord.Repeat())
|
|
new Token("STRING", SkipWs_Disable, ~Char['\"'].Repeat())
|
|
|
== 3. Production rules
|
|
By default, a token will output the string that was captured by the specified RegExpr. It is
|
possible to define production rules for a token, indicating how to instantiate an arbitrary output
|
object based on content captured by that token or other tokens. This uses the following syntax:
|
(the notation (_T1_, _T2_, ...) => _T_ represents a callback delegate with param types T1, T2, etc.,
|
and return type T; in case of void callback, return type is given as _void_ )
|
|
new Token ( _string_ , _RegExpr_)
|
{
|
new Rule < T > (
|
priority: _int_ ,
|
select: (_Token_) => _bool_ ,
|
pre: (_Token_) => _bool_ )
|
{
|
_RuleAction_,
|
...
|
},
|
...
|
}
|
|
'T' stands for the output type when the token is matched. In this case, use of the Rule<T> class
|
means that the token will produce an object of type T based only on the captured string. More
|
complex rules can be specified, which will allow the analysis of syntaxes with recursively delimited
|
expressions (e.g. expressions with nested parentheses), and infix, prefix or postfix operators:
|
|
PrefixRule < T , TOperand > . . . . . . . . . . . . Prefix operator
|
|
PostfixRule < TOperand , T > . . . . . . . . . . . Postfix operator
|
|
InfixRule < TLeftOperand , T , TRightOperand > . . Infix operator
|
|
LeftDelimiterRule < T > . . . . . . . . . . . . . . Left delimiter (e.g. open parenthesis)
|
|
RightDelimiterRule< TLeftDelim , TExpr , T > . . . Left delimiter (e.g. close parenthesis)
|
|
When capturing text, only one of the rules in the token definition will be applied. The conditions
|
for token rule selection are specified by rule selector predicates given in the "select:" param.
|
If a selector predicate fails, the parser will try to select another rule for the token.
|
|
A rule pre-condition can also be specified. This will be tested after the rule was selected and
|
just before it is executed. If the pre-condition fails, this will generate a parse error.
|
|
To obtain the actual production, i.e. instantiation of an output object, a production rule needs
|
to provide one or more actions that describe how to create or manipulate the production object.
|
There are 4 types of possible actions that can be specified inside a production rule:
|
|
Capture < T > ( _(string)_ => _T_ ) New production from token capture
|
|
Create < T , ... > ( _(...)_ => _T_ ) New production from operand productions
|
|
Transform < T , ... > ( _(T,...)_ => _T_ ) New production from current value and operands
|
|
Update < T , ... > ( _(T,...)_ => _void_ ) Update current production value
|
|
Actions may also specify a condition predicate for its params; if the predicate fails, the action
|
will not be taken and rule execution will continue with the next action in the list.
|
|
A non producing action 'Error' can also be specified within a rule to provide additional syntax
|
verification. When an Error predicate is verified, this action will produce a string corresponding
|
to a parse error message; parsing stops at this point.
|
|
|
== 4. Examples
|
|
The following token will match a decimal constant and output its int value:
|
|
new Token("NUM", Char['0', '9'].Repeat())
|
{
|
new Rule<int>
|
{
|
Capture(value => int.Parse(value))
|
}
|
}
|
|
The following token will match the operator '+' in the context of an int expression:
|
|
new Token("PLUS", "+")
|
{
|
new PrefixRule<int, int>(
|
priority: PRIORITY_PREFIX,
|
select: t => (t.IsFirst || t.LookBehind().First().Is("LEFT_PAR"))
|
&& t.LookAhead().First().Is("NUM", "LEFT_PAR"))
|
{
|
Create((int x) => +x)
|
},
|
|
new InfixRule<int, int, int>(priority: PRIORITY_INFIX)
|
{
|
Create((int x, int y) => x + y)
|
}
|
};
|
|
More detailed examples of the use of the RegExpr module can be found in the provided auto-tests
|
(project Test_QtVsTools.RegExpr).
|