Git Product home page Git Product logo

wisent's Introduction

Wisent

Wisent is a (WIP) UTF-8 Lexer and Parser generator supporting LL(1), SLR, LR(0), LALR(1) and LR(1) grammars. The project uses ANTLR grammar syntax, and allows for actions (e.g. switching context with -> Push and -> Pop) and non-greedy matching.

Both type of grammars are supported in push mode (the parser wait a new token from the lexer) and pull mode (the parser automatically invokes the lexer to gather tokens).

This project is currently Work In Progress. In particular, while the generated lexers and parser itself works, the input grammar parser and bootstrapping mechanism is still being worked on.

Lexer

The lexer generated by this project is an UTF-8 Deterministc Finite Automaton built on the fly (i.e. without using an intermediate Non-Deterministic one). Each UTF-8 character belonging to the input alphabet, after being grouped into sets depending on the DFA transactions, is assigned an integer. This allows the usage of the not operator [^ ] and the any character operator . without incurring in performance hits.

Lexer Usage

A DFA can be created from a grammar using the following syntax:

let grammar = Grammar::new(
    &[("LETTER_A", "'a'").into(), ("LETTER_B", "'b'*").into()],
    &[],
);
let dfa = MultiDfa::new(&grammar);

The grammar uses the ANTLR syntax and this software supports switching between lexer modes (a.k.a. contexts) using the ANTLR Push and Pop, hence the name MultiDfa.

After creating a dfa, it can be saved to disk as bytes via serde.

At runtime, a DFASimulator can be used to invoke the DFA and tokenize the input, in the following way:

let dfa = MultiDfa::from_bytes(&serialized);
let input = "any input";
let mut simulator = DfaSimulator::new(&dfa, BufReader::new(input.as_bytes()).bytes());
// retrieves tokens
while let Some(token) = simulator.next_token()? {
    // use the token
}

Parser

The parser can be either a LL parser or a LR parser. For the family of LL parsers, only a parser of type LL(1) can be generated. For the family of LR parsers, a LR(0), SLR, LALR(1) or LR(1) parser can be generated. For both families a push parser or pull parser can be generated.

Note that each family requires a different simulator: a generated LL parser cannot be run at runtime with a LR simulator and vice-versa.

Parser Usage

The LL/LR parsing table can be created with the following syntax:

LR: the result of xx_parsing_table() methods will contain hints if the grammar has SHIFT/REDUCE or REDUCE/REDUCE conflicts.

let g = "antlr grammar here";
let lr_grammar = LRGrammar::try_from(&g).unwrap();
let slrtable = lr_grammar.slr_parsing_table().unwrap(); // for SLR
let lr0 = lr_grammar.lr0_parsing_table().unwrap(); // for LR(0)
let lr1 = lr_grammar.lr1_parsing_table().unwrap(); // for LR(1)
let lalr1 = lr_grammar.lalr1_parsing_table().unwrap(); // for LALR(1)

similarly for LL, with the resulting type of parsing table containing hints if there are FIRST/FIRST or FIRST/FOLLOW conflicts.

let g = "antlr grammar here";
let ll_grammar = LLGrammar::try_from(&g).unwrap();
let table = ll_grammar.parsing_table().unwrap(); // for LL(1)

At runtime, to parse some text it is sufficied to create either a LLParser or LRParser with any of the generated parsing table, and calling the parse method from the trait PullParser and PushParser.

What is left to do

The DFA, LLParsing and LRParsing are essentially finished. The project is currently undergoing some refactoring in the bootstrapping parser: the parser used by this software to understand the input grammar for which the parser generator will be created.

wisent's People

Contributors

davidepi avatar

Watchers

 avatar  avatar

wisent's Issues

Compact equivalence classes

Currently using an HashMap for this one (crate::lexer::SymbolTable). The HashMap was perfectly good for ASCII inputs, but when using a decently big UTF-8 range of characters it explodes in size (~500k for the ANTLR grammar lexer)

Since I already need to use a DFA to read UTF-8 characters, I can combine that DFA with the SymbolTable: reading bytes by bytes from a stream and giving out the equivalence class. Currently the UTF-8 reads from the stream and builds the char, and then the char is is checked into the HashMap to retrieve the equivalence class. This is a waste of time, as both procedures can be combined. Worst case this should use a 4K table (u32 * 255 entries * 1 table for each byte)

Consider also reducing the max equivalence classes to 2^16, so everything can fit inside a u16. I can't even think of a single grammar requiring more than 2^16 tokens.

Tree DFS performance issues

DFS and BFS functions on the tree performs the entire DFS/BFS and then returns the resulting vector.
This is inefficient for methods like find that should stop as soon as a result is found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.