discretetom / retsac Goto Github PK

View Code? Open in Web Editor NEW

7.0 4.0 1.0 3.72 MB

Text lexer and parser. Compiler frontend framework.

Home Page: https://discretetom.github.io/retsac/

License: MIT License

TypeScript 99.88% Shell 0.12%

bison compiler flex lexer llvm lr parser

retsac's Issues

Add decorator support for SelectedAction.

Currently if we want to set data or callback for a selected action, we have to:

Action.from(/123/).data(...).then(...).kinds(...).map(...)

And we can't access the kinds defined in kinds.

Proposed:

Action.from(/123/).kinds(...).map(...).data(...).then(...)

It would be ncie if we can set different data type for different kinds:

Action.from(...).kinds(...).map(...).data('k1', ()=>...).data('k2', ()=>...)

Expectational lexing should be applied in grammar rule level, instead of global.

When we have many grammar rules, expectational lexing will be slow. Every grammar rule will try to do an expectational lexing, and the expectational lexings are unrelated. Without expectational lexing, all grammar rules can share one lexing result.

The value of expectational lexing is not to improve the performance, but to handle lexical errors, e.g. lex regex literal in JavaScript.

To avoid the overhead of the expectational lexing, we should add a property like expect in grammar rules when define the grammar rule. The field should indicate which grammar in this grammar rule should be lexed with expectation.

When there is no expectation, grammar rules should just use the lexing result without expectation (this can also be cached with the current caching mechanism). When there is an expectation, the grammar rule (actually the candidate) will use expectational lexing.

The DFA state should also maintain a map to record when the expectational lexing is needed.

Lexer's ActionExec should take `buffer` & `start` as the params to optimize the performance.

For now Lexer's ActionExec will take buffer as the param, which will cause frequent String.slice to create temp string, which is slow.

By applying buffer & start as the parameters, it's like we are creating a StringView, which is faster.

Many string methods support the start parameter, e.g. startsWith, indexOf.

For regex, we can set lastIndex to let the match start from a specified position if we enable the 'g' flag for the regex.

As the tradeoff, if an action needs a substring, it needs to call slice by itself. If there are many actions creating temp strings like this, there will still be many temp strings, and user need to manage those temp string by themselves to optimize the performance.

Lexer's take/takeUntil should reset lexer action state.

Reduce bundle size.

Currently the serialized grammar parser data contains many string literals. Reduce those literals can reduce the bundle size.

Ideas:

Don't serialize unnecessary fields for DFA in runtime (e.g. firstSet/followSet). Maybe add a new class for the minified DFA.
Use some compression methods like https://developer.mozilla.org/en-US/docs/Web/API/CompressionStream to compress strings.

Action decorator in Lexer.ignore/anonymous?

Complex conflicts handling if we need to peek more than 1 token.

E.g. when we parsing javascript:

f(({a, b}) => a + b);

with the following grammar rules:

exp := '(' '{' identifier (',' identifier)* '}' ')' '=>' exp
exp := '(' exp ')'
exp := object
object := '{' (object_entry (',' object_entry)*)? '}'
object_entry := identifier (':' exp)?

when we digest f(({ a we don't know whether the a is an object entry or an arrow function param. We have to peek maybe many tokens to judge that.

Solution for this issue:

Re-parse, see #19 . Not recommended.
Optimize grammar rules to prevent this to happen. Introducing more intermediate NT. Bad user experience.
Allow grammar rule to do more than reduce AST nodes. E.g. override parser buffer.

WASM accelerated DFA generation

Re-parse for unresolved conflict?

In LR(1) we only check the next grammar to resolve the conflict but sometimes it's not enough.

Maybe we can add something like re-parse to rollback the parser, just like re-lex?

This may impact the performance, consider add a new build option reParse: boolean.

Add lexer.utils for literal regex

Unclosed single line string literal should not include the tail `\n`

Currently, unclosed single line string literal will include the tail \n in the output.content. But the \n shouldn't be a part of the string.

Unify log format

Format [class.method] content
e.g.: [Lexer.feed] 123 chars

Drop tokens.

Currently we store token in ASTNode.token and in Lexer.errors. But for most cases, these tokens are not used, which will waste a lot of memory.

For the ASTNode, a better way is to define some transformer to transform a token into a terminator ASTNode then we can drop the token.

For Lexer.errors, we can define a callback to receive those error tokens.

Typed lexer.

Log debug info for lexer.

Default token data type should be undefined instead of never.

Because SomeType | never equals SomeType instead of SomeType | never. The never will be omitted and causing type union wrong.

More lexer utils.

E.g. blank chars and comments are common requirements when build a compiler.

const lexer = new Lexer.Builder()
  .ignore(/^\s/) // ignore blank chars
  .ignore(Lexer.from_to("//", "\n", true)) // single line comment
  .ignore(Lexer.from_to("/*", "*/", true)) // multiline comment

Specify token kind in Lexer.action's output?

Currently, every action can only map to a single token kind.

If 2 kinds have similiar lexing rule, we have to run these rules 2 times to yield the token.

To solve this, maybe Definition.kind should be a list of all possible kinds instead of a single string.

Add new Lexer util: evalTsString

https://github.com/DiscreteTom/string-hover/blob/59cbfb062bb347086970f75e8f3b1d8ebf99ca95/src/providers/ts.ts#L170-L220

Clone lexer's context?

Usually lexer should be stateless. But sometimes lexer actions can depend on external states using closure. When we use parser the inner lexer may be cloned many times and thus the external states will be messed up.

So we should add a state/context for lexer (maybe also in parser), which should implement a LexerContext interface, and access it in actions.

interface LexerContext {
  clone(): this
}

// inject context in builder.
// the context type should be a part of the lexer's generic type.
// the builder should clone the context value and store as the default context value.
builder.context(...).define(...).build()

// access the context in actions
Lexer.Action.simple((input) => {
  console.log(input.context);
  return 0;
});

Usually the context is an object, maybe we also should implement a default clone method.

Add config module?

To control the behaviour or configure logger of lexer and parser.

Customizable logger

Add dist build

Then we can use CDN like jsdelivr to use this lib in browser.

Use string template function to parse grammar rules.

Panic mode in ELR parser

If no token can be lexed (with expectation) when parsing, the parser should call lexer.take(1) to eat one char, then try to continue.

Be ware, when working with parser, you shouldn't use lexerBuilder.ignore(/^./) to implement lexing error handling, since it will be used in trimStart.

ELR.ParserBuilder can bridge ParserBuilders with different data type.

E.g. we have a pb1 = ParserBuilder<number> and a pb2 = ParserBuilder<number[]>. We can make pb2 to use pb1 with a bridge which convert number to a number[].

Besides, we can also bridge type name, in this case we can use 3rd party ParserBuilders.

Make all actions RegExp?

RegExp is commonly used in downstream components like tmLanguage.

Maybe we should make all actions RegExp? Then maybe it would be easy to transform the lexer into tmLanguage.json?

Make re-lex optional?

Make re-lex optional to improve the runtime performance?
Is there a way to check whether the re-lex is required?

Add `lexer.takeUntil` for error handling?

E.g.

lexer.takeUntil('\n')
lexer.takeUntil('}')

Format & lint before merge pull req

Structured logging

Customizable Token.data.

Allow token to store a custom type data?

Lexer.stringLiteral accept regex as start/close.

Then we can use stringLiteral('`', { close: /`|\$\{/ }) to match multi open/end at the same time.

Lexer.lex with instant action?

Use a temporary action to lex the input?

lexer.lex({ action: new Action() })

Maybe this is useful in error handling?

Customizable DFA candidate?

Serialize/Deserialize DFA

It's unnecessary to build DFA each time.

Ideas:

Assign unique id/index to each DFA candidate/state, store as JSON. Re-connect objects when load.
Follow sets, first sets, etc.

Issues:

What about user-defined functions?
Cache grammar rules to check if they are changed?

Extract serialized parser data type.

Rename literals in grammar rule?

For now only N/NT can be renamed using @. Literals should also be able to be renamed.

E.g. define({ xx: `'123'@someName` }).

Placeholder conflict resolver generated by advanced parser builder maybe invalid.

Consider the following grammar rule:

{ exps: `exp (',' exp)* ','?` }

Expanded & generated grammar rule:

{ exps: `exp | exp ',' | exp __0 | exp __0 ','` }
{ __0: `',' exp | ',' exp __0` }

One of the generated conflict resolver:

{ __0: `',' exp` } vs { __0: `',' exp __0` }, next: *, accept: false

When parsing exp ',' exp ',', and we digested exp ',' exp, trying to reduce to exp __0, but rejected by the conflict resolver since we want greedy match, the grammar rule can't be accepted.

As the conclusion, grammar snippet decorated with +*? shouldn't be followed by some other grammar snippet, where the decorated grammar snippet starts with the follower grammar snippet.

Can we check this and auto generate correct resolvers?

Maybe #19 is a correct solution?

E.g. treat the cascade prefix as a generic parameter for the IParserBuilder and ignore these generated grammars.

Flatten the AST

Ref: https://www.cs.cornell.edu/~asampson/blog/flattening.html

Lazy match in AdvancedBuilder.

Just like in regex, support ??, +? and *?. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Quantifiers

discretetom / retsac Goto Github PK

retsac's Issues

Recommend Projects

Recommend Topics

Recommend Org