Wisent is a (WIP) UTF-8 Lexer and Parser generator supporting LL(1), SLR, LR(0), LALR(1) and LR(1) grammars.
The project uses ANTLR grammar syntax, and allows for actions (e.g. switching context with -> Push
and -> Pop
) and non-greedy matching.
Both type of grammars are supported in push mode (the parser wait a new token from the lexer) and pull mode (the parser automatically invokes the lexer to gather tokens).
This project is currently Work In Progress. In particular, while the generated lexers and parser itself works, the input grammar parser and bootstrapping mechanism is still being worked on.
The lexer generated by this project is an UTF-8 Deterministc Finite Automaton built on the fly (i.e. without using an intermediate Non-Deterministic one). Each UTF-8 character belonging to the input alphabet, after being grouped into sets depending on the DFA transactions, is assigned an integer. This allows the usage of the not operator [^ ]
and the any character operator .
without incurring in performance hits.
A DFA can be created from a grammar using the following syntax:
let grammar = Grammar::new(
&[("LETTER_A", "'a'").into(), ("LETTER_B", "'b'*").into()],
&[],
);
let dfa = MultiDfa::new(&grammar);
The grammar uses the ANTLR syntax and this software supports switching between lexer modes (a.k.a. contexts) using the ANTLR Push
and Pop
, hence the name MultiDfa
.
After creating a dfa, it can be saved to disk as bytes via serde.
At runtime, a DFASimulator
can be used to invoke the DFA and tokenize the input, in the following way:
let dfa = MultiDfa::from_bytes(&serialized);
let input = "any input";
let mut simulator = DfaSimulator::new(&dfa, BufReader::new(input.as_bytes()).bytes());
// retrieves tokens
while let Some(token) = simulator.next_token()? {
// use the token
}
The parser can be either a LL parser or a LR parser. For the family of LL parsers, only a parser of type LL(1) can be generated. For the family of LR parsers, a LR(0), SLR, LALR(1) or LR(1) parser can be generated. For both families a push parser or pull parser can be generated.
Note that each family requires a different simulator: a generated LL parser cannot be run at runtime with a LR simulator and vice-versa.
The LL/LR parsing table can be created with the following syntax:
LR: the result of xx_parsing_table() methods will contain hints if the grammar has SHIFT/REDUCE or REDUCE/REDUCE conflicts.
let g = "antlr grammar here";
let lr_grammar = LRGrammar::try_from(&g).unwrap();
let slrtable = lr_grammar.slr_parsing_table().unwrap(); // for SLR
let lr0 = lr_grammar.lr0_parsing_table().unwrap(); // for LR(0)
let lr1 = lr_grammar.lr1_parsing_table().unwrap(); // for LR(1)
let lalr1 = lr_grammar.lalr1_parsing_table().unwrap(); // for LALR(1)
similarly for LL, with the resulting type of parsing table containing hints if there are FIRST/FIRST or FIRST/FOLLOW conflicts.
let g = "antlr grammar here";
let ll_grammar = LLGrammar::try_from(&g).unwrap();
let table = ll_grammar.parsing_table().unwrap(); // for LL(1)
At runtime, to parse some text it is sufficied to create either a LLParser
or LRParser
with any of the generated parsing table, and calling the parse
method from the trait PullParser
and PushParser
.
The DFA, LLParsing and LRParsing are essentially finished. The project is currently undergoing some refactoring in the bootstrapping parser: the parser used by this software to understand the input grammar for which the parser generator will be created.