omp_utils/QtVSTools.git

 
==== RegExpr: (not just) a C# wrapper for System.Text.RegularExpressions.Regex 
 
The purpose of this module is to allow: 
 
    1.  Defining regular expressions as C# code instead of plain-old-strings; 
 
    2.  Marking elements of regular expressions as tokens, allowing captured text to be accessed 
        and manipulated through token IDs; 
 
    3.  Creating token production rules that specify how to process the captured tokens. 
 
 
== 0. "TL;DR" 
 
  * Regular expressions can be written as C# statements without any additional pre-processing. 
 
  * A token definition within a regular expression allows matched text to be captured. 
 
  * Tokens can include production rules that calculate an output object when matching the token. 
 
  * Only one rule from the list of available rules in a token will be selected during parsing. 
 
  * A rule can define a list of actions to be executed in sequence when that rule is selected. 
 
  * Parser output will include all objects created by production rule actions. 
 
 
== 1. Regular expressions as C# statements 
 
The classes in this module can be instantiated using C# statements to specify regular expressions 
that are checked at compile-time, unlike plain-old-strings. Specifying reg-ex'es directly in C# 
will potentially also make them more readable and maintainable. 
 
The following class hierarchy provides abstract representations of regular expressions: 
 
    abstract RegExpr . . . . . . . . Base class of the regular expression abstraction 
    ^ 
    | 
    +--+ abstract CharClass  . . . . Match one character of a class of characters 
    |    ^ 
    |    | 
    |    +--+ CharClassLiteral . . . Match one character of a list of characters 
    |    | 
    |    +--+ CharClassRange . . . . Match one character of a range of characters 
    |    | 
    |    +--+ CharClassSet . . . . . Match one character of a set of character classes 
    | 
    +--+ RegExprLiteral  . . . . . . Match a sequence of characters 
    | 
    +--+ RegExprRepeat . . . . . . . Match the same pattern repeatedly 
    | 
    +--+ RegExprSequence . . . . . . Match several patterns in sequence 
    | 
    +--+ RegExprChoice . . . . . . . Match one of several alternative patterns 
    | 
    +--+ RegExprAssert . . . . . . . Assert a pattern but do not consume any characters 
    | 
    +--+ Token . . . . . . . . . . . Capture and process text matched by a pattern 
 
The following syntax can be used to specify regular expressions in C# using RegExpr classes 
(the notation _T_ represents an instance of type T): 
 
    Expression                               Type                Description 
---------------------------------------------------------------------------------------------------- 
    CharWord                                 CharClassLiteral    Word character (\w) 
 
    CharCr                                   CharClassLiteral    Carriage return character (\r) 
 
    CharLf                                   CharClassLiteral    Line feed character (\n) 
 
    CharSpace                                CharClassLiteral    Space character (\s) 
 
    CharNonSpace                             CharClassLiteral    Non-space character (\S = [^\s]) 
 
    CharVertSpace                            CharClassSet        Vertical space ([\r\n]) 
 
    CharHorizSpace                           CharClassSet        Horizontal space ([^\S\r\n]) 
 
    Char [ _char_ ]                          CharClassLiteral    Literal character class 
 
    Char [ _string_ ]                        CharClassLiteral    Literal character class 
 
    Char [ _char_ , _char_ ]                 CharClassRange      Range character class 
 
    ~ _CharClass_                            CharClassSet        Negated character class 
 
    CharSet [ _CharClass_ + _CharClass_ ]    CharClassSet        Combined character class 
 
    CharSet [ _CharClass_ - _CharClass_ ]    CharClassSet        Character class subtraction 
 
    CharSetRaw [ _string_ ]                  CharClassLiteral    Raw (unescaped) character class 
 
    AnyChar                                  RegExprLiteral      Match any character (.) 
 
    StartOfLine                              RegExprLiteral      Anchor for start of line (^) 
 
    EndOfFile                                RegExprLiteral      Anchor for end of input string ($) 
 
    LineBreak                                RegExprSequence     Match a line break (\r?\n) 
 
    RegX ( _string_ )                        RegExprLiteral      Literal character sequence 
 
    _RegExpr_ .Optional()                    RegExprRepeat       Optional match 
 
    _RegExpr_ .Repeat()                      RegExprRepeat       Match zero or more times 
 
    _RegExpr_ .Repeat ( atLeast: _int_ )     RegExprRepeat       Match at least N times 
 
    _RegExpr_ .Repeat ( atMost: _int_ )      RegExprRepeat       Match at most N times 
 
    _RegExpr_ .Repeat ( _int_ )              RegExprRepeat       Match exactly N times 
 
    _RegExpr_ .Repeat ( _int_ , _int_ )      RegExprRepeat       Match between N and M times 
 
    _RegExpr_ & _RegExpr_                    RegExprSequence     Sequential composition 
 
    _RegExpr_ | _RegExpr_                    RegExprChoice       Alternating composition 
 
    Assert [ _RegExpr_ ]                     RegExprAssert       Positive assertion (look ahead) 
 
    !Assert [ _RegExpr_ ]                    RegExprAssert       Negative assertion (look ahead) 
 
    Assert [ _RegExpr_ ] > _RegExpr_         RegExprSequence     Assert look ahead before match 
 
    _RegExpr_ > Assert [ _RegExpr_ ]         RegExprSequence     Assert look ahead after match 
 
    Assert [ _RegExpr_ ] < _RegExpr_         RegExprSequence     Assert look behind before match 
 
    _RegExpr_ < Assert [ _RegExpr_ ]         RegExprSequence     Assert look behind after match 
 
    !Assert [ _RegExpr_ ] > _RegExpr_        RegExprSequence     Negative look ahead before match 
 
    _RegExpr_ > !Assert [ _RegExpr_ ]        RegExprSequence     Negative look ahead after match 
 
    !Assert [ _RegExpr_ ] < _RegExpr_        RegExprSequence     Negative look behind before match 
 
    _RegExpr_ < !Assert [ _RegExpr_ ]        RegExprSequence     Negative look behind after match 
 
    RegXRaw ( _string_ )                     RegExprLiteral      Raw (unescaped) Regex string 
 
Examples of regular expressions as C# statements: 
 
    RegExpr (C#)                                Regular Expression 
---------------------------------------------------------------------- 
    Char["abc"]                                 [abc] 
 
    Char['a', 'z']                              [a-z] 
 
    ~Char["abc"]                                [^abc] 
 
    CharSet[Char["abc"] + Char['x', 'z']]       [abcx-z] 
 
    CharSet[Char['a', 'z'] - Char["aeiou"]]     [a-z-[aeiou]] 
 
    CharSetRaw["az-[aeiou]"]                    [a-z-[aeiou]] 
 
    RegX("abc")                                 abc 
 
    RegX(@"\a\b\c")                             \\a\\b\\c 
 
    RegXRaw(@"\S\r\n")                          \S\r\n 
 
    RegX("a").Optional()                        a? 
 
    RegX("a").Repeat()                          a* 
 
    RegX("a").Repeat(atLeast: 1)                a+ 
 
    RegX("a").Repeat(atLeast: 2)                a{2,} 
 
    RegX("a").Repeat(atMost: 3)                 a{,3} 
 
    RegX("a").Repeat(4)                         a{4} 
 
    RegX("a").Repeat(5, 6)                      a{5,6} 
 
    RegX("a") & "xyz"                           axyz 
 
    RegX("a") | "xyz"                           (?:a)|(?:xyz) 
 
    Assert["abc"]                               (?=abc) 
 
    Assert[RegX("abc")] > CharWord.Repeat()     (?=abc)\w* 
 
    !Assert[RegX("abc") > CharWord.Repeat()     (?!abc)\w* 
 
    RegX("abc") > Assert[RegX("xyz")]           abc(?=xyz) 
 
    Assert[RegX("abc")] < RegX("xyz")           (?<=abc)xyz 
 
    CharWord.Repeat() < Assert[RegX("xyz")]     \w*(?<=xyz) 
 
    CharWord.Repeat() < !Assert[RegX("xyz")]    \w*(?<!xyz) 
 
 
== 2. Tokens 
 
The following statement creates a token based on a RegExpr: 
 
    new Token ( _string_ , _RegExpr_ ) 
 
The string param will be used as an ID to reference the text captured by the RegExpr. 
The whitespace immediately before a token can be automatically skipped. A RegExpr to match the 
whitespace can be provided as a default for all tokens, or given specifically to each token: 
 
    new Token ( _string_ , _RegExpr_ , _RegExpr_ ) 
 
The first RegExpr param specifies the pattern of leading whitespace to be skipped for this token. 
Whitespace skipping can be disabled for specific tokens: 
 
    new Token ( _string_ , SkipWs_Disable , _RegExpr_ ) 
 
The following are examples of token definitions: 
 
    new Token("NUM", Char['0', '9'].Repeat()) 
 
    new Token("WORD", Space.Repeat(), CharWord.Repeat()) 
 
    new Token("STRING", SkipWs_Disable, ~Char['\"'].Repeat()) 
 
 
== 3. Production rules 
 
By default, a token will output the string that was captured by the specified RegExpr. It is 
possible to define production rules for a token, indicating how to instantiate an arbitrary output 
object based on content captured by that token or other tokens. This uses the following syntax: 
(the notation (_T1_, _T2_, ...) => _T_ represents a callback delegate with param types T1, T2, etc., 
and return type T; in case of void callback, return type is given as _void_ ) 
 
    new Token ( _string_ , _RegExpr_) 
    { 
        new Rule < T > ( 
            priority: _int_ , 
            select: (_Token_) => _bool_ , 
            pre: (_Token_) => _bool_ ) 
        { 
            _RuleAction_, 
            ... 
        }, 
        ... 
    } 
 
'T' stands for the output type when the token is matched. In this case, use of the Rule<T> class 
means that the token will produce an object of type T based only on the captured string. More 
complex rules can be specified, which will allow the analysis of syntaxes with recursively delimited 
expressions (e.g. expressions with nested parentheses), and infix, prefix or postfix operators: 
 
    PrefixRule < T , TOperand > . . . . . . . . . . . . Prefix operator 
 
    PostfixRule < TOperand , T >  . . . . . . . . . . . Postfix operator 
 
    InfixRule < TLeftOperand , T , TRightOperand >  . . Infix operator 
 
    LeftDelimiterRule < T > . . . . . . . . . . . . . . Left delimiter (e.g. open parenthesis) 
 
    RightDelimiterRule< TLeftDelim , TExpr , T >  . . . Left delimiter (e.g. close parenthesis) 
 
When capturing text, only one of the rules in the token definition will be applied. The conditions 
for token rule selection are specified by rule selector predicates given in the "select:" param. 
If a selector predicate fails, the parser will try to select another rule for the token. 
 
A rule pre-condition can also be specified. This will be tested after the rule was selected and 
just before it is executed. If the pre-condition fails, this will generate a parse error. 
 
To obtain the actual production, i.e. instantiation of an output object, a production rule needs 
to provide one or more actions that describe how to create or manipulate the production object. 
There are 4 types of possible actions that can be specified inside a production rule: 
 
    Capture   < T >       ( _(string)_ => _T_    )    New production from token capture 
 
    Create    < T , ... > ( _(...)_    => _T_    )    New production from operand productions 
 
    Transform < T , ... > ( _(T,...)_  => _T_    )    New production from current value and operands 
 
    Update    < T , ... > ( _(T,...)_  => _void_ )    Update current production value 
 
Actions may also specify a condition predicate for its params; if the predicate fails, the action 
will not be taken and rule execution will continue with the next action in the list. 
 
A non producing action 'Error' can also be specified within a rule to provide additional syntax 
verification. When an Error predicate is verified, this action will produce a string corresponding 
to a parse error message; parsing stops at this point. 
 
 
== 4. Examples 
 
The following token will match a decimal constant and output its int value: 
 
    new Token("NUM", Char['0', '9'].Repeat()) 
    { 
        new Rule<int> 
        { 
            Capture(value => int.Parse(value)) 
        } 
    } 
 
The following token will match the operator '+' in the context of an int expression: 
 
    new Token("PLUS", "+") 
    { 
        new PrefixRule<int, int>( 
            priority: PRIORITY_PREFIX, 
            select: t => (t.IsFirst || t.LookBehind().First().Is("LEFT_PAR")) 
                    && t.LookAhead().First().Is("NUM", "LEFT_PAR")) 
        { 
            Create((int x) => +x) 
        }, 
 
        new InfixRule<int, int, int>(priority: PRIORITY_INFIX) 
        { 
            Create((int x, int y) => x + y) 
        } 
    }; 
 
More detailed examples of the use of the RegExpr module can be found in the provided auto-tests 
(project Test_QtVsTools.RegExpr).