neils-s / mumps-parser Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 288 KB

A parser of the ANSI '95 standard MUMPS syntax written in ANSI-compliant MUMPS and tested using GT.M

License: GNU General Public License v2.0

M 100.00%

mumps-parser's Introduction

MUMPS-parser

A parser of the ANSI '95 standard MUMPS syntax written in ANSI-compliant MUMPS and tested using GT.M

mumps-parser's People

Contributors

Stargazers

Watchers

Forkers

whitten efuzy g9020

mumps-parser's Issues

Sufficiency of Parsing by this Grammar ?

Neil,
Thank you for putting this code on github.
I am a member of the M Development Committee, and several folks are interested in creating a new M Standard for the 2020s. I am functioning as the chair of the Testing Subcomittee, and thus in charge of the MVTS (M Validation Test Suite).

I am encouraged by the existence of this code.

I would like to use your parser to recognize whether something that says it is static M code, actually passes the EBNF, at least before resolving the metalanguage symbols. Those are a pain, but part of what makes M so powerful.

I would be more comfortable using it if you gave me permission.

I'll end up needing to understand how you created this code, because I actually need a parser for each of the Standards, ie: those from: 1977, 1984, 1990, 1995. There is another Standard, which used to be called the "Millenium" Standard which did not get approval from ANSI and ISO before the MDC went into hiatus.

I wonder what class of grammar this code implements, as the last time I tried to create an LALR(1) grammar, I failed horribly with many shift/reduce conflicts. It's possible it is LALR(1) at least at routine-save-time but I don't know. (It has been too many years since my effort) Did you hand-create it, or did you generate it from some other program ?

Thanks for your feedback,
Dave Whitten
713-870-3834
[email protected]

PS: I'm interested in your current upgrade.
What do you need to do ?
Are you just parsing Standard M, or are you enhancing it to parse COS or some other implementor's extensions?

Allow this to act as a semantic token provider for VS Code

In addition to normal textmate grammars (https://macromates.com/manual/en/language_grammars), VS Code has the ability to use a semantic token provider (https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide) to tokenize text. It would be nice to use this parser as such a token provider. This would allow VS Code to act as a sweet IDE for MUMPS.

Detect left recursion in the grammar

Like all PEG parsers, a left-recursion in the syntax map will cause this parser to fall into an infinite loop. To see why, imagine the following grammar rules:
; A generic expression of digit addition
@Map@("expression") = "options"
@Map@("expression",1) = "token"
@Map@("expression",1,"value")="digit"
@Map@("expression",2) = "token"
@Map@("expression",2,"value") = "sum"

; A sum of 2 expressions
@map@("sum")="subtreeChain"
@map@("sum",1)="token"
@map@("sum",1,"value")="expression"
@map@("sum",2)="literal"
@map@("sum",2,"value")="+"
@map@("sum",3)="token"
@map@("sum",3,"value")="expression"

These rules say that an "expression" is either a single digit (presumably 0-9) or a "sum"; while a "sum" is of the form "expression+expression".
The problem here is that if the parser tries to match a line of input to an "expression", it needs to check if that same line of input is a "sum", which requires that it checks if the line is an "expression" and so forth. This is an infinite recursion.

More formally, a "left-recursive" map is any map that admits a recursive definition without consuming at least one character of the input from the left end of the input string.

It would be nice to have some way to detect left-recursion in a syntax map. That would be a useful feature beyond parsing.

Better handling of ambiguously defined delimited lists and options

Currently, when parsing a delimited list the "maximal munch" strategy will treat as much of the input string as part of the delimited list as it possibly can. This is problematic, because in some grammars (not MUMPS) you could mistakenly consume so much of the input as part of the delimited list that it's not possible to consume the rest of the input.

For example, one could imagine a grammar where spaces are used as a list delimiter, so the following is a valid list stored in the variable x:
x=foo bar feep
If the same grammar allows whitespace to delimit key words like FOR and PRINT, then we have problem parsing something like this:
list=foo bar feep FOR(x in list) PRINT x
Presumably, this code is intended to print out foo, bar, and feep. The parser is going to have problems handling it, though.
When the parser tries to parse the list, it will treat the space between "feep" and "FOR" as a list delimiter, so the keyword FOR will end up as part of the list. This will screw up the rest of the parsing.

I need to make the delimlist parser smarter. It should be able to parse n-many list items, and then see if the rest of the input can be parsed.

Maybe add a NOT predicate

This parser isn't quite a PEG parser because it doesn't allow a NOT predicate. In other words, it isn't possible to say that a token "can not be of some form". When parsing MUMPS, this isn't needed, but it might be useful in some cases. For example, let's say we have a grammar that delimits lists using spaces, and breaks the delimiting when it hits a reserved command keyword like "IF" or "FOR". An example of a line of code in such a language is
LET list=one two three PRINT list
It would be easier to create a grammar array to parse this type of syntax if we could say something like "NOT Keyword".
Another use case might be to say that a number can start with "a digit that is NOT 0". That would be a bit easier than my current strategy of creating a "nonzero digit" token to include the numbers 1-9.

The way a NOT predicate would work is as follows:
For "Token1 NOT Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing succeeded, then the parser would immediately stop and return a negative number, representing a parse error. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1, returning the result of this parsing Token1.

Similarly, we could define an AND predicate. Usually, the AND predicate is de-sugared from the NOT predicate as "AND:=NOT NOT", but that won't quite work for us because of the way the parser returns values. Instead, we could define the AND predicate as follows:
for "Token1 AND Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing failed, then the parser would immediately stop and return the negative number that represented the error encountered when parsing Token2. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1; returning the result of this parsing Token1.

Allowing the AND and NOT predicates makes a few aspects of the grammar easier to encode, but makes it possible to create grammar rules that are not expressed by a railroad track syntax diagram.
As an example, consider the language that consists of strings like the following:
{empty}
abc
aabbcc
aaabbbccc
aaaabbbbcccc
...
a^n b^n c^n (for n>=0)
This language is not the product of any context-free grammar, but with the AND predicates we could build it from the following rules:
TokenA = a+TokenA OR a (strings of the form "aaaa...a")
TokenB = a+TokenB+b OR a+b (strings like "aabb" or "aaaabbbb" with equal numbers of "a" and "b")
TokenBcomplete = TokenB AND (TokenB+c) (strings composed of a TokenB immediately followed by a "c" character)
TokenC = b+TokenC+c OR b+c (strings like "bbcc" or "bbbbcccc" with equal numbers of "b" and "c")
TokenD = TokenA+TokenC AND TokenBcomplete (non-empty strings with equal numbers of "a", "b", and "c")
TokenE = {empty} OR TokenD + {end} (the empty string or a string with equal numbers of "a", "b", and "c")

With these rules, a string like "aaabbcc" would fail to parse as TokenD (or any other token we've given), since the leading string "aaabbc" is not parseable as a case of TokenBcomplete. The string "aabbcc" would parse as follows:
aabbcc
-> TokenA
-> -> a+TokenA
-> -> a
-> TokenC
-> -> b+TokenC+c
-> -> -> b+c

I'm not sure if the work required to augment the parser is justified by the theoretical gains.
I'll have to let this idea season a while before committing to it.

neils-s / mumps-parser Goto Github PK

mumps-parser's Introduction

MUMPS-parser

mumps-parser's People

Contributors

Stargazers

Watchers

Forkers

mumps-parser's Issues

Sufficiency of Parsing by this Grammar ?

Allow this to act as a semantic token provider for VS Code

Detect left recursion in the grammar

Better handling of ambiguously defined delimited lists and options

Maybe add a NOT predicate

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent