discretetom / retsac Goto Github PK

View Code? Open in Web Editor NEW

7.0 4.0 1.0 3.72 MB

Text lexer and parser. Compiler frontend framework.

Home Page: https://discretetom.github.io/retsac/

License: MIT License

TypeScript 99.88% Shell 0.12%

bison compiler flex lexer llvm lr parser

retsac's Introduction

Retsac

Warning

This project is still in early development stage, the API may change frequently.

Text lexer and parser. Compiler frontend framework.

This can be used to fast prototype your own programming language compiler/translator frontend, or parse your domain specific language.

Try it online in the playground.

Installation

yarn add retsac

Features

The Lexer, yield token from the text input string.
- Regex support. See examples below.
- Built-in util functions.
  - JavaScript's string literal, numeric literal, integer literal, identifier, etc.
  - JSON's string literal, numeric literal.
- Support custom functions.
The Parser, co-work with the lexer and produce an AST (Abstract Syntax Tree).
- ELR (Expectational LR) parser.
  - Meta characters like +*? when defining a grammar rule.
  - Conflict detection, try to auto resolve conflicts.
  - Query children nodes by using $('name') instead of children[index].
  - Top-down traverse the AST.
  - Bottom-up reduce data.
  - Expect lexer to yield specific token type and/or content.
  - Try to re-lex the input if parsing failed.
  - DFA serialization & hydration to accelerate future building.
- Serializable AST object to co-work with other tools (e.g. compiler backend libs like LLVM).
Strict type checking with TypeScript.
- This is amazing, you'd better try this out by yourself.

Resources

Examples

JSON Parser

In this example, we use AdvancedBuilder to define grammar rules with +*?, define top-down traversers using traverser, and query nodes in grammar rules using $ and $$.

All conflicts are auto resolved.

Click to Expand

const lexer = new Lexer.Builder()
  .ignore(Lexer.whitespaces()) // ignore blank characters
  .define({
    // built-in support for JSON
    string: Lexer.json.stringLiteral(),
    number: Lexer.json.numericLiteral(),
  })
  .define(Lexer.wordKind("true", "false", "null")) // token's kind name equals to the literal value
  .anonymous(Lexer.exact(..."[]{},:")) // single char borders without a kind name
  .build();

export const builder = new ELR.AdvancedBuilder({ lexer })
  .data<unknown>()
  .define(
    { value: `string | number | true | false | null` },
    // eval the only child's text to get the value
    (d) => d.traverser(({ children }) => eval(children[0].text!)),
  )
  .define(
    { value: `object | array` },
    // call the only child's traverse method to get the object/array value
    (d) => d.traverser(({ children }) => children[0].traverse()),
  )
  .define(
    // `?` for zero or one, `*` for zero or more, use `()` to group
    // quote literal values with `'` or `"`
    { array: `'[' (value (',' value)*)? ']'` },
    // use `$$` to select all children with the given name
    // traverse all values in the array and return the result as an array
    (d) => d.traverser(({ $$ }) => $$(`value`).map((v) => v.traverse())),
  )
  .define(
    // use `@` to rename a grammar
    { object_item: `string@key ':' value` },
    (d) =>
      // return an object
      // use `$` to select the first child with the given name
      d.traverser(({ $ }) => ({
        // eval the key's value, then traverse child to get the entry's value
        [eval($(`key`)!.text!)]: $(`value`)!.traverse(),
      })),
  )
  .define({ object: `'{' (object_item (',' object_item)*)? '}'` }, (d) =>
    d.traverser(({ $$ }) =>
      // every object_item's traverse result is an object, we need to merge them.
      Object.assign({}, ...$$(`object_item`).map((item) => item.traverse())),
    ),
  );

Calculator

In this example, we use reducer to define bottom-up data reducers, so we can get the result when the AST is built.

There are conflicts introduced by those grammar rules, we use the high-level resolver API priority to resolve them.

Click to Expand

const lexer = new Lexer.Builder()
  .ignore(Lexer.whitespaces()) // ignore blank characters
  .define({ number: /[0-9]+(?:\.[0-9]+)?/ })
  .anonymous(Lexer.exact(..."+-*/()")) // operators
  .build();

export const builder = new ELR.ParserBuilder({ lexer })
  .data<number>()
  .define({ exp: "number" }, (d) =>
    // the result of the reducer will be stored in the node's value
    d.reducer(({ matched }) => Number(matched[0].text)),
  )
  .define({ exp: `'-' exp` }, (d) => d.reducer(({ values }) => -values[1]!))
  .define({ exp: `'(' exp ')'` }, (d) => d.reducer(({ values }) => values[1]))
  .define({ exp: `exp '+' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! + values[2]!),
  )
  .define({ exp: `exp '-' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! - values[2]!),
  )
  .define({ exp: `exp '*' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! * values[2]!),
  )
  .define({ exp: `exp '/' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! / values[2]!),
  )
  .priority(
    { exp: `'-' exp` }, // highest priority
    [{ exp: `exp '*' exp` }, { exp: `exp '/' exp` }],
    [{ exp: `exp '+' exp` }, { exp: `exp '-' exp` }], // lowest priority
  );

Contribute

All issues and pull requests are highly welcomed.

retsac's People

Contributors

Stargazers

Watchers

Forkers

gmh5225

retsac's Issues

Re-parse for unresolved conflict?

In LR(1) we only check the next grammar to resolve the conflict but sometimes it's not enough.

Maybe we can add something like re-parse to rollback the parser, just like re-lex?

This may impact the performance, consider add a new build option reParse: boolean.

Rename literals in grammar rule?

For now only N/NT can be renamed using @. Literals should also be able to be renamed.

E.g. define({ xx: `'123'@someName` }).

Drop tokens.

Currently we store token in ASTNode.token and in Lexer.errors. But for most cases, these tokens are not used, which will waste a lot of memory.

For the ASTNode, a better way is to define some transformer to transform a token into a terminator ASTNode then we can drop the token.

For Lexer.errors, we can define a callback to receive those error tokens.

Add lexer.utils for literal regex

Add `lexer.takeUntil` for error handling?

E.g.

lexer.takeUntil('\n')
lexer.takeUntil('}')

Add decorator support for SelectedAction.

Currently if we want to set data or callback for a selected action, we have to:

Action.from(/123/).data(...).then(...).kinds(...).map(...)

And we can't access the kinds defined in kinds.

Proposed:

Action.from(/123/).kinds(...).map(...).data(...).then(...)

It would be ncie if we can set different data type for different kinds:

Action.from(...).kinds(...).map(...).data('k1', ()=>...).data('k2', ()=>...)

Lexer's take/takeUntil should reset lexer action state.

Placeholder conflict resolver generated by advanced parser builder maybe invalid.

Consider the following grammar rule:

{ exps: `exp (',' exp)* ','?` }

Expanded & generated grammar rule:

{ exps: `exp | exp ',' | exp __0 | exp __0 ','` }
{ __0: `',' exp | ',' exp __0` }

One of the generated conflict resolver:

{ __0: `',' exp` } vs { __0: `',' exp __0` }, next: *, accept: false

When parsing exp ',' exp ',', and we digested exp ',' exp, trying to reduce to exp __0, but rejected by the conflict resolver since we want greedy match, the grammar rule can't be accepted.

As the conclusion, grammar snippet decorated with +*? shouldn't be followed by some other grammar snippet, where the decorated grammar snippet starts with the follower grammar snippet.

Can we check this and auto generate correct resolvers?

Maybe #19 is a correct solution?

Lazy match in AdvancedBuilder.

Just like in regex, support ??, +? and *?. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Quantifiers

Usually lexer should be stateless. But sometimes lexer actions can depend on external states using closure. When we use parser the inner lexer may be cloned many times and thus the external states will be messed up.

So we should add a state/context for lexer (maybe also in parser), which should implement a LexerContext interface, and access it in actions.

interface LexerContext {
  clone(): this
}

// inject context in builder.
// the context type should be a part of the lexer's generic type.
// the builder should clone the context value and store as the default context value.
builder.context(...).define(...).build()

// access the context in actions
Lexer.Action.simple((input) => {
  console.log(input.context);
  return 0;
});

Usually the context is an object, maybe we also should implement a default clone method.

Add new Lexer util: evalTsString

https://github.com/DiscreteTom/string-hover/blob/59cbfb062bb347086970f75e8f3b1d8ebf99ca95/src/providers/ts.ts#L170-L220

More lexer utils.

E.g. blank chars and comments are common requirements when build a compiler.

const lexer = new Lexer.Builder()
  .ignore(/^\s/) // ignore blank chars
  .ignore(Lexer.from_to("//", "\n", true)) // single line comment
  .ignore(Lexer.from_to("/*", "*/", true)) // multiline comment

Expectational lexing should be applied in grammar rule level, instead of global.

When we have many grammar rules, expectational lexing will be slow. Every grammar rule will try to do an expectational lexing, and the expectational lexings are unrelated. Without expectational lexing, all grammar rules can share one lexing result.

The value of expectational lexing is not to improve the performance, but to handle lexical errors, e.g. lex regex literal in JavaScript.

To avoid the overhead of the expectational lexing, we should add a property like expect in grammar rules when define the grammar rule. The field should indicate which grammar in this grammar rule should be lexed with expectation.

When there is no expectation, grammar rules should just use the lexing result without expectation (this can also be cached with the current caching mechanism). When there is an expectation, the grammar rule (actually the candidate) will use expectational lexing.

The DFA state should also maintain a map to record when the expectational lexing is needed.

Unify log format

Format [class.method] content
e.g.: [Lexer.feed] 123 chars

Extract serialized parser data type.

Make re-lex optional?

Make re-lex optional to improve the runtime performance?
Is there a way to check whether the re-lex is required?

Complex conflicts handling if we need to peek more than 1 token.

E.g. when we parsing javascript:

f(({a, b}) => a + b);

with the following grammar rules:

exp := '(' '{' identifier (',' identifier)* '}' ')' '=>' exp
exp := '(' exp ')'
exp := object
object := '{' (object_entry (',' object_entry)*)? '}'
object_entry := identifier (':' exp)?

when we digest f(({ a we don't know whether the a is an object entry or an arrow function param. We have to peek maybe many tokens to judge that.

Solution for this issue:

Re-parse, see #19 . Not recommended.
Optimize grammar rules to prevent this to happen. Introducing more intermediate NT. Bad user experience.
Allow grammar rule to do more than reduce AST nodes. E.g. override parser buffer.

Lexer.lex with instant action?

Use a temporary action to lex the input?

lexer.lex({ action: new Action() })

Maybe this is useful in error handling?

Serialize/Deserialize DFA

It's unnecessary to build DFA each time.

Ideas:

Assign unique id/index to each DFA candidate/state, store as JSON. Re-connect objects when load.
Follow sets, first sets, etc.

Issues:

What about user-defined functions?
Cache grammar rules to check if they are changed?

Use string template function to parse grammar rules.

Serialized parser data generated by AdvancedParser is not assignable to IParserBuilder's BuildOptions.

Since the serialized data generated by AdvancedParser has the generated grammars, the data is not assignable to the BuildOptions.

Currently the workaround is to cast the type. Maybe there is some better ways?

E.g. treat the cascade prefix as a generic parameter for the IParserBuilder and ignore these generated grammars.

Default token data type should be undefined instead of never.

Because SomeType | never equals SomeType instead of SomeType | never. The never will be omitted and causing type union wrong.

Add config module?

To control the behaviour or configure logger of lexer and parser.

Reduce bundle size.

Currently the serialized grammar parser data contains many string literals. Reduce those literals can reduce the bundle size.

Ideas:

Don't serialize unnecessary fields for DFA in runtime (e.g. firstSet/followSet). Maybe add a new class for the minified DFA.
Use some compression methods like https://developer.mozilla.org/en-US/docs/Web/API/CompressionStream to compress strings.

Customizable logger

Lexer's ActionExec should take `buffer` & `start` as the params to optimize the performance.

For now Lexer's ActionExec will take buffer as the param, which will cause frequent String.slice to create temp string, which is slow.

By applying buffer & start as the parameters, it's like we are creating a StringView, which is faster.

Many string methods support the start parameter, e.g. startsWith, indexOf.

For regex, we can set lastIndex to let the match start from a specified position if we enable the 'g' flag for the regex.

As the tradeoff, if an action needs a substring, it needs to call slice by itself. If there are many actions creating temp strings like this, there will still be many temp strings, and user need to manage those temp string by themselves to optimize the performance.