This parser isn't quite a PEG parser because it doesn't allow a NOT predicate. In other words, it isn't possible to say that a token "can not be of some form". When parsing MUMPS, this isn't needed, but it might be useful in some cases. For example, let's say we have a grammar that delimits lists using spaces, and breaks the delimiting when it hits a reserved command keyword like "IF" or "FOR". An example of a line of code in such a language is
LET list=one two three PRINT list
It would be easier to create a grammar array to parse this type of syntax if we could say something like "NOT Keyword".
Another use case might be to say that a number can start with "a digit that is NOT 0". That would be a bit easier than my current strategy of creating a "nonzero digit" token to include the numbers 1-9.
The way a NOT predicate would work is as follows:
For "Token1 NOT Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing succeeded, then the parser would immediately stop and return a negative number, representing a parse error. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1, returning the result of this parsing Token1.
Similarly, we could define an AND predicate. Usually, the AND predicate is de-sugared from the NOT predicate as "AND:=NOT NOT", but that won't quite work for us because of the way the parser returns values. Instead, we could define the AND predicate as follows:
for "Token1 AND Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing failed, then the parser would immediately stop and return the negative number that represented the error encountered when parsing Token2. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1; returning the result of this parsing Token1.
Allowing the AND and NOT predicates makes a few aspects of the grammar easier to encode, but makes it possible to create grammar rules that are not expressed by a railroad track syntax diagram.
As an example, consider the language that consists of strings like the following:
{empty}
abc
aabbcc
aaabbbccc
aaaabbbbcccc
...
a^n b^n c^n (for n>=0)
This language is not the product of any context-free grammar, but with the AND predicates we could build it from the following rules:
TokenA = a+TokenA OR a (strings of the form "aaaa...a")
TokenB = a+TokenB+b OR a+b (strings like "aabb" or "aaaabbbb" with equal numbers of "a" and "b")
TokenBcomplete = TokenB AND (TokenB+c) (strings composed of a TokenB immediately followed by a "c" character)
TokenC = b+TokenC+c OR b+c (strings like "bbcc" or "bbbbcccc" with equal numbers of "b" and "c")
TokenD = TokenA+TokenC AND TokenBcomplete (non-empty strings with equal numbers of "a", "b", and "c")
TokenE = {empty} OR TokenD + {end} (the empty string or a string with equal numbers of "a", "b", and "c")
With these rules, a string like "aaabbcc" would fail to parse as TokenD (or any other token we've given), since the leading string "aaabbc" is not parseable as a case of TokenBcomplete. The string "aabbcc" would parse as follows:
aabbcc
-> TokenA
-> -> a+TokenA
-> -> a
-> TokenC
-> -> b+TokenC+c
-> -> -> b+c
I'm not sure if the work required to augment the parser is justified by the theoretical gains.
I'll have to let this idea season a while before committing to it.