discretetom / retsac Goto Github PK
View Code? Open in Web Editor NEWText lexer and parser. Compiler frontend framework.
Home Page: https://discretetom.github.io/retsac/
License: MIT License
Text lexer and parser. Compiler frontend framework.
Home Page: https://discretetom.github.io/retsac/
License: MIT License
Currently if we want to set data or callback for a selected action, we have to:
Action.from(/123/).data(...).then(...).kinds(...).map(...)
And we can't access the kinds defined in kinds
.
Proposed:
Action.from(/123/).kinds(...).map(...).data(...).then(...)
It would be ncie if we can set different data type for different kinds:
Action.from(...).kinds(...).map(...).data('k1', ()=>...).data('k2', ()=>...)
When we have many grammar rules, expectational lexing will be slow. Every grammar rule will try to do an expectational lexing, and the expectational lexings are unrelated. Without expectational lexing, all grammar rules can share one lexing result.
The value of expectational lexing is not to improve the performance, but to handle lexical errors, e.g. lex regex literal in JavaScript.
To avoid the overhead of the expectational lexing, we should add a property like expect
in grammar rules when define the grammar rule. The field should indicate which grammar in this grammar rule should be lexed with expectation.
When there is no expectation, grammar rules should just use the lexing result without expectation (this can also be cached with the current caching mechanism). When there is an expectation, the grammar rule (actually the candidate) will use expectational lexing.
The DFA state should also maintain a map to record when the expectational lexing is needed.
For now Lexer's ActionExec will take buffer
as the param, which will cause frequent String.slice
to create temp string, which is slow.
By applying buffer
& start
as the parameters, it's like we are creating a StringView, which is faster.
Many string methods support the start
parameter, e.g. startsWith
, indexOf
.
For regex, we can set lastIndex
to let the match start from a specified position if we enable the 'g' flag for the regex.
As the tradeoff, if an action needs a substring, it needs to call slice
by itself. If there are many actions creating temp strings like this, there will still be many temp strings, and user need to manage those temp string by themselves to optimize the performance.
Currently the serialized grammar parser data contains many string literals. Reduce those literals can reduce the bundle size.
Ideas:
E.g. when we parsing javascript:
f(({a, b}) => a + b);
with the following grammar rules:
exp := '(' '{' identifier (',' identifier)* '}' ')' '=>' exp
exp := '(' exp ')'
exp := object
object := '{' (object_entry (',' object_entry)*)? '}'
object_entry := identifier (':' exp)?
when we digest f(({ a
we don't know whether the a
is an object entry or an arrow function param. We have to peek maybe many tokens to judge that.
Solution for this issue:
In LR(1) we only check the next grammar to resolve the conflict but sometimes it's not enough.
Maybe we can add something like re-parse to rollback the parser, just like re-lex?
This may impact the performance, consider add a new build option reParse: boolean
.
Currently, unclosed single line string literal will include the tail \n
in the output.content
. But the \n
shouldn't be a part of the string.
Format [class.method] content
e.g.: [Lexer.feed] 123 chars
Currently we store token in ASTNode.token
and in Lexer.errors
. But for most cases, these tokens are not used, which will waste a lot of memory.
For the ASTNode
, a better way is to define some transformer to transform a token into a terminator ASTNode then we can drop the token.
For Lexer.errors
, we can define a callback to receive those error tokens.
Because SomeType | never
equals SomeType
instead of SomeType | never
. The never
will be omitted and causing type union wrong.
E.g. blank chars and comments are common requirements when build a compiler.
const lexer = new Lexer.Builder()
.ignore(/^\s/) // ignore blank chars
.ignore(Lexer.from_to("//", "\n", true)) // single line comment
.ignore(Lexer.from_to("/*", "*/", true)) // multiline comment
Currently, every action can only map to a single token kind.
If 2 kinds have similiar lexing rule, we have to run these rules 2 times to yield the token.
To solve this, maybe Definition.kind
should be a list of all possible kinds instead of a single string.
Usually lexer should be stateless. But sometimes lexer actions can depend on external states using closure. When we use parser the inner lexer may be cloned many times and thus the external states will be messed up.
So we should add a state/context for lexer (maybe also in parser), which should implement a LexerContext
interface, and access it in actions.
interface LexerContext {
clone(): this
}
// inject context in builder.
// the context type should be a part of the lexer's generic type.
// the builder should clone the context value and store as the default context value.
builder.context(...).define(...).build()
// access the context in actions
Lexer.Action.simple((input) => {
console.log(input.context);
return 0;
});
Usually the context is an object, maybe we also should implement a default clone method.
To control the behaviour or configure logger of lexer and parser.
Then we can use CDN like jsdelivr to use this lib in browser.
If no token can be lexed (with expectation) when parsing, the parser should call lexer.take(1)
to eat one char, then try to continue.
Be ware, when working with parser, you shouldn't use lexerBuilder.ignore(/^./)
to implement lexing error handling, since it will be used in trimStart
.
E.g. we have a pb1 = ParserBuilder<number>
and a pb2 = ParserBuilder<number[]>
. We can make pb2
to use pb1
with a bridge which convert number
to a number[]
.
Besides, we can also bridge type name, in this case we can use 3rd party ParserBuilders.
RegExp is commonly used in downstream components like tmLanguage
.
Maybe we should make all actions RegExp? Then maybe it would be easy to transform the lexer into tmLanguage.json
?
Make re-lex optional to improve the runtime performance?
Is there a way to check whether the re-lex is required?
E.g.
lexer.takeUntil('\n')
lexer.takeUntil('}')
Allow token to store a custom type data?
Then we can use stringLiteral('`', { close: /`|\$\{/ })
to match multi open/end at the same time.
Use a temporary action to lex the input?
lexer.lex({ action: new Action() })
Maybe this is useful in error handling?
It's unnecessary to build DFA each time.
Ideas:
Issues:
For now only N/NT can be renamed using @
. Literals should also be able to be renamed.
E.g. define({ xx: `'123'@someName` })
.
Consider the following grammar rule:
{ exps: `exp (',' exp)* ','?` }
Expanded & generated grammar rule:
{ exps: `exp | exp ',' | exp __0 | exp __0 ','` }
{ __0: `',' exp | ',' exp __0` }
One of the generated conflict resolver:
{ __0: `',' exp` } vs { __0: `',' exp __0` }, next: *, accept: false
When parsing exp ',' exp ','
, and we digested exp ',' exp
, trying to reduce to exp __0
, but rejected by the conflict resolver since we want greedy match, the grammar rule can't be accepted.
As the conclusion, grammar snippet decorated with +*?
shouldn't be followed by some other grammar snippet, where the decorated grammar snippet starts with the follower grammar snippet.
Can we check this and auto generate correct resolvers?
Maybe #19 is a correct solution?
Currently the conflict rejecter is implemented using Rejecter
. But it should be more efficient to implement it in Candidate.tryReduce
.
Since the serialized data generated by AdvancedParser has the generated grammars, the data is not assignable to the BuildOptions.
Currently the workaround is to cast the type. Maybe there is some better ways?
E.g. treat the cascade prefix as a generic parameter for the IParserBuilder and ignore these generated grammars.
Just like in regex, support ??
, +?
and *?
. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Quantifiers
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.