Wisent

Wisent is a (WIP) UTF-8 Lexer and Parser generator supporting LL(1), SLR, LR(0), LALR(1) and LR(1) grammars. The project uses ANTLR grammar syntax, and allows for actions (e.g. switching context with -> Push and -> Pop) and non-greedy matching.

Both type of grammars are supported in push mode (the parser wait a new token from the lexer) and pull mode (the parser automatically invokes the lexer to gather tokens).

This project is currently Work In Progress. In particular, while the generated lexers and parser itself works, the input grammar parser and bootstrapping mechanism is still being worked on.

Lexer

The lexer generated by this project is an UTF-8 Deterministc Finite Automaton built on the fly (i.e. without using an intermediate Non-Deterministic one). Each UTF-8 character belonging to the input alphabet, after being grouped into sets depending on the DFA transactions, is assigned an integer. This allows the usage of the not operator [^ ] and the any character operator . without incurring in performance hits.

Lexer Usage

A DFA can be created from a grammar using the following syntax:

let grammar = Grammar::new(
    &[("LETTER_A", "'a'").into(), ("LETTER_B", "'b'*").into()],
    &[],
);
let dfa = MultiDfa::new(&grammar);

The grammar uses the ANTLR syntax and this software supports switching between lexer modes (a.k.a. contexts) using the ANTLR Push and Pop, hence the name MultiDfa.

After creating a dfa, it can be saved to disk as bytes via serde.

At runtime, a DFASimulator can be used to invoke the DFA and tokenize the input, in the following way:

let dfa = MultiDfa::from_bytes(&serialized);
let input = "any input";
let mut simulator = DfaSimulator::new(&dfa, BufReader::new(input.as_bytes()).bytes());
// retrieves tokens
while let Some(token) = simulator.next_token()? {
    // use the token
}

Parser

The parser can be either a LL parser or a LR parser. For the family of LL parsers, only a parser of type LL(1) can be generated. For the family of LR parsers, a LR(0), SLR, LALR(1) or LR(1) parser can be generated. For both families a push parser or pull parser can be generated.

Note that each family requires a different simulator: a generated LL parser cannot be run at runtime with a LR simulator and vice-versa.

Parser Usage

The LL/LR parsing table can be created with the following syntax:

LR: the result of xx_parsing_table() methods will contain hints if the grammar has SHIFT/REDUCE or REDUCE/REDUCE conflicts.

let g = "antlr grammar here";
let lr_grammar = LRGrammar::try_from(&g).unwrap();
let slrtable = lr_grammar.slr_parsing_table().unwrap(); // for SLR
let lr0 = lr_grammar.lr0_parsing_table().unwrap(); // for LR(0)
let lr1 = lr_grammar.lr1_parsing_table().unwrap(); // for LR(1)
let lalr1 = lr_grammar.lalr1_parsing_table().unwrap(); // for LALR(1)

similarly for LL, with the resulting type of parsing table containing hints if there are FIRST/FIRST or FIRST/FOLLOW conflicts.

let g = "antlr grammar here";
let ll_grammar = LLGrammar::try_from(&g).unwrap();
let table = ll_grammar.parsing_table().unwrap(); // for LL(1)

At runtime, to parse some text it is sufficied to create either a LLParser or LRParser with any of the generated parsing table, and calling the parse method from the trait PullParser and PushParser.

What is left to do

The DFA, LLParsing and LRParsing are essentially finished. The project is currently undergoing some refactoring in the bootstrapping parser: the parser used by this software to understand the input grammar for which the parser generator will be created.

davidepi / wisent Goto Github PK

wisent's Introduction

Wisent

Lexer

Lexer Usage

Parser

Parser Usage

What is left to do

wisent's People

Contributors

Watchers

wisent's Issues

Compact equivalence classes

Tree DFS performance issues

Remove skip symbols from the grammar

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent