Git Product home page Git Product logo

retsac's Introduction

Retsac

npm coverage build license

Warning

This project is still in early development stage, the API may change frequently.

Text lexer and parser. Compiler frontend framework.

This can be used to fast prototype your own programming language compiler/translator frontend, or parse your domain specific language.

Try it online in the playground.

Installation

yarn add retsac

Features

  • The Lexer, yield token from the text input string.
    • Regex support. See examples below.
    • Built-in util functions.
      • JavaScript's string literal, numeric literal, integer literal, identifier, etc.
      • JSON's string literal, numeric literal.
    • Support custom functions.
  • The Parser, co-work with the lexer and produce an AST (Abstract Syntax Tree).
    • ELR (Expectational LR) parser.
      • Meta characters like +*? when defining a grammar rule.
      • Conflict detection, try to auto resolve conflicts.
      • Query children nodes by using $('name') instead of children[index].
      • Top-down traverse the AST.
      • Bottom-up reduce data.
      • Expect lexer to yield specific token type and/or content.
      • Try to re-lex the input if parsing failed.
      • DFA serialization & hydration to accelerate future building.
    • Serializable AST object to co-work with other tools (e.g. compiler backend libs like LLVM).
  • Strict type checking with TypeScript.
    • This is amazing, you'd better try this out by yourself.

Resources

In this example, we use AdvancedBuilder to define grammar rules with +*?, define top-down traversers using traverser, and query nodes in grammar rules using $ and $$.

All conflicts are auto resolved.

Click to Expand
const lexer = new Lexer.Builder()
  .ignore(Lexer.whitespaces()) // ignore blank characters
  .define({
    // built-in support for JSON
    string: Lexer.json.stringLiteral(),
    number: Lexer.json.numericLiteral(),
  })
  .define(Lexer.wordKind("true", "false", "null")) // token's kind name equals to the literal value
  .anonymous(Lexer.exact(..."[]{},:")) // single char borders without a kind name
  .build();

export const builder = new ELR.AdvancedBuilder({ lexer })
  .data<unknown>()
  .define(
    { value: `string | number | true | false | null` },
    // eval the only child's text to get the value
    (d) => d.traverser(({ children }) => eval(children[0].text!)),
  )
  .define(
    { value: `object | array` },
    // call the only child's traverse method to get the object/array value
    (d) => d.traverser(({ children }) => children[0].traverse()),
  )
  .define(
    // `?` for zero or one, `*` for zero or more, use `()` to group
    // quote literal values with `'` or `"`
    { array: `'[' (value (',' value)*)? ']'` },
    // use `$$` to select all children with the given name
    // traverse all values in the array and return the result as an array
    (d) => d.traverser(({ $$ }) => $$(`value`).map((v) => v.traverse())),
  )
  .define(
    // use `@` to rename a grammar
    { object_item: `string@key ':' value` },
    (d) =>
      // return an object
      // use `$` to select the first child with the given name
      d.traverser(({ $ }) => ({
        // eval the key's value, then traverse child to get the entry's value
        [eval($(`key`)!.text!)]: $(`value`)!.traverse(),
      })),
  )
  .define({ object: `'{' (object_item (',' object_item)*)? '}'` }, (d) =>
    d.traverser(({ $$ }) =>
      // every object_item's traverse result is an object, we need to merge them.
      Object.assign({}, ...$$(`object_item`).map((item) => item.traverse())),
    ),
  );

In this example, we use reducer to define bottom-up data reducers, so we can get the result when the AST is built.

There are conflicts introduced by those grammar rules, we use the high-level resolver API priority to resolve them.

Click to Expand
const lexer = new Lexer.Builder()
  .ignore(Lexer.whitespaces()) // ignore blank characters
  .define({ number: /[0-9]+(?:\.[0-9]+)?/ })
  .anonymous(Lexer.exact(..."+-*/()")) // operators
  .build();

export const builder = new ELR.ParserBuilder({ lexer })
  .data<number>()
  .define({ exp: "number" }, (d) =>
    // the result of the reducer will be stored in the node's value
    d.reducer(({ matched }) => Number(matched[0].text)),
  )
  .define({ exp: `'-' exp` }, (d) => d.reducer(({ values }) => -values[1]!))
  .define({ exp: `'(' exp ')'` }, (d) => d.reducer(({ values }) => values[1]))
  .define({ exp: `exp '+' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! + values[2]!),
  )
  .define({ exp: `exp '-' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! - values[2]!),
  )
  .define({ exp: `exp '*' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! * values[2]!),
  )
  .define({ exp: `exp '/' exp` }, (d) =>
    d.reducer(({ values }) => values[0]! / values[2]!),
  )
  .priority(
    { exp: `'-' exp` }, // highest priority
    [{ exp: `exp '*' exp` }, { exp: `exp '/' exp` }],
    [{ exp: `exp '+' exp` }, { exp: `exp '-' exp` }], // lowest priority
  );

Contribute

All issues and pull requests are highly welcomed.

retsac's People

Contributors

dependabot[bot] avatar discretetom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

gmh5225

retsac's Issues

Re-parse for unresolved conflict?

In LR(1) we only check the next grammar to resolve the conflict but sometimes it's not enough.

Maybe we can add something like re-parse to rollback the parser, just like re-lex?

This may impact the performance, consider add a new build option reParse: boolean.

Drop tokens.

Currently we store token in ASTNode.token and in Lexer.errors. But for most cases, these tokens are not used, which will waste a lot of memory.

For the ASTNode, a better way is to define some transformer to transform a token into a terminator ASTNode then we can drop the token.

For Lexer.errors, we can define a callback to receive those error tokens.

Add decorator support for SelectedAction.

Currently if we want to set data or callback for a selected action, we have to:

Action.from(/123/).data(...).then(...).kinds(...).map(...)

And we can't access the kinds defined in kinds.

Proposed:

Action.from(/123/).kinds(...).map(...).data(...).then(...)

It would be ncie if we can set different data type for different kinds:

Action.from(...).kinds(...).map(...).data('k1', ()=>...).data('k2', ()=>...)

Placeholder conflict resolver generated by advanced parser builder maybe invalid.

Consider the following grammar rule:

{ exps: `exp (',' exp)* ','?` }

Expanded & generated grammar rule:

{ exps: `exp | exp ',' | exp __0 | exp __0 ','` }
{ __0: `',' exp | ',' exp __0` }

One of the generated conflict resolver:

{ __0: `',' exp` } vs { __0: `',' exp __0` }, next: *, accept: false

When parsing exp ',' exp ',', and we digested exp ',' exp, trying to reduce to exp __0, but rejected by the conflict resolver since we want greedy match, the grammar rule can't be accepted.

As the conclusion, grammar snippet decorated with +*? shouldn't be followed by some other grammar snippet, where the decorated grammar snippet starts with the follower grammar snippet.

Can we check this and auto generate correct resolvers?

Maybe #19 is a correct solution?

Clone lexer's context?

Usually lexer should be stateless. But sometimes lexer actions can depend on external states using closure. When we use parser the inner lexer may be cloned many times and thus the external states will be messed up.

So we should add a state/context for lexer (maybe also in parser), which should implement a LexerContext interface, and access it in actions.

interface LexerContext {
  clone(): this
}

// inject context in builder.
// the context type should be a part of the lexer's generic type.
// the builder should clone the context value and store as the default context value.
builder.context(...).define(...).build()

// access the context in actions
Lexer.Action.simple((input) => {
  console.log(input.context);
  return 0;
});

Usually the context is an object, maybe we also should implement a default clone method.

More lexer utils.

E.g. blank chars and comments are common requirements when build a compiler.

const lexer = new Lexer.Builder()
  .ignore(/^\s/) // ignore blank chars
  .ignore(Lexer.from_to("//", "\n", true)) // single line comment
  .ignore(Lexer.from_to("/*", "*/", true)) // multiline comment

Expectational lexing should be applied in grammar rule level, instead of global.

When we have many grammar rules, expectational lexing will be slow. Every grammar rule will try to do an expectational lexing, and the expectational lexings are unrelated. Without expectational lexing, all grammar rules can share one lexing result.

The value of expectational lexing is not to improve the performance, but to handle lexical errors, e.g. lex regex literal in JavaScript.

To avoid the overhead of the expectational lexing, we should add a property like expect in grammar rules when define the grammar rule. The field should indicate which grammar in this grammar rule should be lexed with expectation.

When there is no expectation, grammar rules should just use the lexing result without expectation (this can also be cached with the current caching mechanism). When there is an expectation, the grammar rule (actually the candidate) will use expectational lexing.

The DFA state should also maintain a map to record when the expectational lexing is needed.

Make re-lex optional?

Make re-lex optional to improve the runtime performance?
Is there a way to check whether the re-lex is required?

Complex conflicts handling if we need to peek more than 1 token.

E.g. when we parsing javascript:

f(({a, b}) => a + b);

with the following grammar rules:

exp := '(' '{' identifier (',' identifier)* '}' ')' '=>' exp
exp := '(' exp ')'
exp := object
object := '{' (object_entry (',' object_entry)*)? '}'
object_entry := identifier (':' exp)?

when we digest f(({ a we don't know whether the a is an object entry or an arrow function param. We have to peek maybe many tokens to judge that.

Solution for this issue:

  • Re-parse, see #19 . Not recommended.
  • Optimize grammar rules to prevent this to happen. Introducing more intermediate NT. Bad user experience.
  • Allow grammar rule to do more than reduce AST nodes. E.g. override parser buffer.

Serialize/Deserialize DFA

It's unnecessary to build DFA each time.

Ideas:

  • Assign unique id/index to each DFA candidate/state, store as JSON. Re-connect objects when load.
  • Follow sets, first sets, etc.

Issues:

  • What about user-defined functions?
  • Cache grammar rules to check if they are changed?

Lexer's ActionExec should take `buffer` & `start` as the params to optimize the performance.

For now Lexer's ActionExec will take buffer as the param, which will cause frequent String.slice to create temp string, which is slow.

By applying buffer & start as the parameters, it's like we are creating a StringView, which is faster.

Many string methods support the start parameter, e.g. startsWith, indexOf.

For regex, we can set lastIndex to let the match start from a specified position if we enable the 'g' flag for the regex.

As the tradeoff, if an action needs a substring, it needs to call slice by itself. If there are many actions creating temp strings like this, there will still be many temp strings, and user need to manage those temp string by themselves to optimize the performance.

Make all actions RegExp?

RegExp is commonly used in downstream components like tmLanguage.

Maybe we should make all actions RegExp? Then maybe it would be easy to transform the lexer into tmLanguage.json?

Specify token kind in Lexer.action's output?

Currently, every action can only map to a single token kind.

If 2 kinds have similiar lexing rule, we have to run these rules 2 times to yield the token.

To solve this, maybe Definition.kind should be a list of all possible kinds instead of a single string.

Panic mode in ELR parser

If no token can be lexed (with expectation) when parsing, the parser should call lexer.take(1) to eat one char, then try to continue.

Be ware, when working with parser, you shouldn't use lexerBuilder.ignore(/^./) to implement lexing error handling, since it will be used in trimStart.

Add dist build

Then we can use CDN like jsdelivr to use this lib in browser.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.