Git Product home page Git Product logo

mumps-parser's Introduction

MUMPS-parser

A parser of the ANSI '95 standard MUMPS syntax written in ANSI-compliant MUMPS and tested using GT.M

mumps-parser's People

Contributors

neils-s avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

whitten efuzy g9020

mumps-parser's Issues

Sufficiency of Parsing by this Grammar ?

Neil,
Thank you for putting this code on github.
I am a member of the M Development Committee, and several folks are interested in creating a new M Standard for the 2020s. I am functioning as the chair of the Testing Subcomittee, and thus in charge of the MVTS (M Validation Test Suite).

I am encouraged by the existence of this code.

I would like to use your parser to recognize whether something that says it is static M code, actually passes the EBNF, at least before resolving the metalanguage symbols. Those are a pain, but part of what makes M so powerful.

I would be more comfortable using it if you gave me permission.

I'll end up needing to understand how you created this code, because I actually need a parser for each of the Standards, ie: those from: 1977, 1984, 1990, 1995. There is another Standard, which used to be called the "Millenium" Standard which did not get approval from ANSI and ISO before the MDC went into hiatus.

I wonder what class of grammar this code implements, as the last time I tried to create an LALR(1) grammar, I failed horribly with many shift/reduce conflicts. It's possible it is LALR(1) at least at routine-save-time but I don't know. (It has been too many years since my effort) Did you hand-create it, or did you generate it from some other program ?

Thanks for your feedback,
Dave Whitten
713-870-3834
[email protected]

PS: I'm interested in your current upgrade.
What do you need to do ?
Are you just parsing Standard M, or are you enhancing it to parse COS or some other implementor's extensions?

Detect left recursion in the grammar

Like all PEG parsers, a left-recursion in the syntax map will cause this parser to fall into an infinite loop. To see why, imagine the following grammar rules:
; A generic expression of digit addition
@Map@("expression") = "options"
@Map@("expression",1) = "token"
@Map@("expression",1,"value")="digit"
@Map@("expression",2) = "token"
@Map@("expression",2,"value") = "sum"

; A sum of 2 expressions
@map@("sum")="subtreeChain"
@map@("sum",1)="token"
@map@("sum",1,"value")="expression"
@map@("sum",2)="literal"
@map@("sum",2,"value")="+"
@map@("sum",3)="token"
@map@("sum",3,"value")="expression"

These rules say that an "expression" is either a single digit (presumably 0-9) or a "sum"; while a "sum" is of the form "expression+expression".
The problem here is that if the parser tries to match a line of input to an "expression", it needs to check if that same line of input is a "sum", which requires that it checks if the line is an "expression" and so forth. This is an infinite recursion.

More formally, a "left-recursive" map is any map that admits a recursive definition without consuming at least one character of the input from the left end of the input string.

It would be nice to have some way to detect left-recursion in a syntax map. That would be a useful feature beyond parsing.

Better handling of ambiguously defined delimited lists and options

Currently, when parsing a delimited list the "maximal munch" strategy will treat as much of the input string as part of the delimited list as it possibly can. This is problematic, because in some grammars (not MUMPS) you could mistakenly consume so much of the input as part of the delimited list that it's not possible to consume the rest of the input.

For example, one could imagine a grammar where spaces are used as a list delimiter, so the following is a valid list stored in the variable x:
x=foo bar feep
If the same grammar allows whitespace to delimit key words like FOR and PRINT, then we have problem parsing something like this:
list=foo bar feep FOR(x in list) PRINT x
Presumably, this code is intended to print out foo, bar, and feep. The parser is going to have problems handling it, though.
When the parser tries to parse the list, it will treat the space between "feep" and "FOR" as a list delimiter, so the keyword FOR will end up as part of the list. This will screw up the rest of the parsing.

I need to make the delimlist parser smarter. It should be able to parse n-many list items, and then see if the rest of the input can be parsed.

Maybe add a NOT predicate

This parser isn't quite a PEG parser because it doesn't allow a NOT predicate. In other words, it isn't possible to say that a token "can not be of some form". When parsing MUMPS, this isn't needed, but it might be useful in some cases. For example, let's say we have a grammar that delimits lists using spaces, and breaks the delimiting when it hits a reserved command keyword like "IF" or "FOR". An example of a line of code in such a language is
LET list=one two three PRINT list
It would be easier to create a grammar array to parse this type of syntax if we could say something like "NOT Keyword".
Another use case might be to say that a number can start with "a digit that is NOT 0". That would be a bit easier than my current strategy of creating a "nonzero digit" token to include the numbers 1-9.

The way a NOT predicate would work is as follows:
For "Token1 NOT Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing succeeded, then the parser would immediately stop and return a negative number, representing a parse error. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1, returning the result of this parsing Token1.

Similarly, we could define an AND predicate. Usually, the AND predicate is de-sugared from the NOT predicate as "AND:=NOT NOT", but that won't quite work for us because of the way the parser returns values. Instead, we could define the AND predicate as follows:
for "Token1 AND Token2", the parser would first try to parse the next input characters as if they were Token2. If that parsing failed, then the parser would immediately stop and return the negative number that represented the error encountered when parsing Token2. Otherwise, the parse return value would be reset to 0 and the parser would try to parse the next input characters as if they were Token1; returning the result of this parsing Token1.

Allowing the AND and NOT predicates makes a few aspects of the grammar easier to encode, but makes it possible to create grammar rules that are not expressed by a railroad track syntax diagram.
As an example, consider the language that consists of strings like the following:
{empty}
abc
aabbcc
aaabbbccc
aaaabbbbcccc
...
a^n b^n c^n (for n>=0)
This language is not the product of any context-free grammar, but with the AND predicates we could build it from the following rules:
TokenA = a+TokenA OR a (strings of the form "aaaa...a")
TokenB = a+TokenB+b OR a+b (strings like "aabb" or "aaaabbbb" with equal numbers of "a" and "b")
TokenBcomplete = TokenB AND (TokenB+c) (strings composed of a TokenB immediately followed by a "c" character)
TokenC = b+TokenC+c OR b+c (strings like "bbcc" or "bbbbcccc" with equal numbers of "b" and "c")
TokenD = TokenA+TokenC AND TokenBcomplete (non-empty strings with equal numbers of "a", "b", and "c")
TokenE = {empty} OR TokenD + {end} (the empty string or a string with equal numbers of "a", "b", and "c")

With these rules, a string like "aaabbcc" would fail to parse as TokenD (or any other token we've given), since the leading string "aaabbc" is not parseable as a case of TokenBcomplete. The string "aabbcc" would parse as follows:
aabbcc
-> TokenA
-> -> a+TokenA
-> -> a
-> TokenC
-> -> b+TokenC+c
-> -> -> b+c

I'm not sure if the work required to augment the parser is justified by the theoretical gains.
I'll have to let this idea season a while before committing to it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.