Git Product home page Git Product logo

centaurus's Introduction

Centaurus

Centaurus is a performance-oriented LL(*)-S parser generator for large-scale data processing. LL(*)-S (S stands for scannerless) grammars are a subset of token-level LL(*) grammars such that the corresponding character-level LL(*) ones exist; Centaurus thus generates scannerless parsers from token-level LL(*) grammars. Centaurus parsers run actions in parallel for scalability.

How to use

Unlike common parser generators like Bison and ANTLR, Centaurus is not packaged as a standalone tool. Instead of generating the source code of parsers to be included in the client code ahead of time, Centaurus generates the native code ready to run at runtime. Centaurus thus forms a linkable library.

To generate a parser with Centaurus, the client code simply has to pass a grammar file to the library. Then, Centaurus generates the executable parser code immediately in the form of an opaque object if it succeeds in LL(*)-S parser construction. After attaching semantic actions to the parser object, the parser object is ready to handle the input data.

Build instructions for the library as follows.

Operating principle and key features

Runtime parser generation

Centaurus compiles a CFG-style grammar written in EBNF to an executable parser code at runtime. The direct translation from the grammar to the executable code allows the optimizer of Centaurus to utilize the structural information of the grammar, which is difficult to obtain from the parser source code generated in a common manner.

One of the loop optimization techniques implemented in the current version is efficient regex matching using the SSE4.2 instruction set. When instructed, the runtime code generator will employ the PCMPISTRI instruction to match against an indefinite repetition of character classes, which is denoted using Kleene stars. This optimization will enable the generated program to read up to 16 characters in a single instruction, as opposed to one when the optimization does not take place.

Parallel reduction

The generated parser itself does not get so meaningful job done; all it does is test if the input data matches the grammar specification. To process the data along the grammar, we must attach semantic actions to it, which stipulate how the data are extracted and converted so the user program could utilize them.

With Centaurus, the semantic actions are defined in the form of callbacks within the user program, which are invoked from within the library as the parsing process goes on.

The implementation of callbacks quite resembles that of Bison and ANTLR, in that they are given a list of semantic values associated with each symbol on the RHS of a reduction rule, and they return the semantic value associated with the LHS symbol.

However, Centaurus differs from Bison and ANTLR in that these callbacks are invoked in parallel to improve the parsing performance. Because the parallelization happens within the runtime library in a transparent manner, the callbacks could be written exactly as they are written for serial reduction, except that they must not have any side effects for obvious reasons.

Centaurus is also capable of parallelizing the reduction workload across multiple processes instead of threads. This feature could be beneficial for interpreter platforms with GIL (global interpreter lock), including CPython and Ruby MRI, though only CPython is currently supported as an interpreter platform.

Supported platforms

Currently, Centaurus only supports Linux running on an AMD64 processor with SSE4.2, and it has to be compiled with g++ (Clang is not supported). The library code is actually designed to work with Windows MSVC and Cygwin g++, but the build system needs further modification to be compatible with these platforms.

The generated parser code conforms to the System V AMD64 ABI, while POSIX IPC (message queues, semaphores and shared memory) is utilized by the runtime library for communication between the workers.

Build instructions

Linux g++

CMake (>=3.1) along with g++ (>=4.9) is required to build the software.

 $ git clone https://github.com/satoshigeyuki/Centaurus.git
 $ cd Centaurus
 $ git submodule update --init --recursive
 $ mkdir build && cd build
 $ cmake .. -DCMAKE_BUILD_TYPE=Release
 $ make && make install

Tutorial

Tutorial 1. Build a calculator

In this tutorial, you will learn how to make a simple C++ calculator using Centaurus.

Step 1. Write the grammar

An example grammar for the calculator is shown below:

grammar CALC;

INPUT : EXPR ;
EXPR : TERM '+' EXPR | TERM ;
TERM : FACT '*' TERM | FACT ;
FACT : /[0-9]+/ | '(' EXPR ')' ;

There are few things to note in this example.

  • The grammar must be compatible with the LL grammar class, which means there cannot be any left recursion.
  • The scanner definition is merged into the grammar, because we don't have a scanner. The grammar will be compiled into a character-level PDA which parses the input one character at a time.
  • The first line denotes the name of the grammar, but it does not have any meaning at this moment. The directive is optional.

Step 2. Write the semantic actions

Once we defined the grammar, we have to write the semantic action to associate with each rule in the grammar. The semantic actions must be defined within the user program, not the grammar. This is because the grammar does not go through the host compiler, unlike Bison; it is only processed by the parser generator which only understands the grammar part.

Each semantic action has the following signature:

void *func(const SymbolContext<char>& ctx);

The only parameter, ctx, contains all the necessary information to implement the action, including the semantic value of each RHS symbol and the string representation of LHS/RHS symbols.

The function returns a void pointer which shall wrap the semantic value in an opaque manner. This could typically be a pointer to a user-defined class casted to void *.

The semantic actions for the calculator example is shown below:

#include <string>
#include <Context.hpp>

using namespace Centaurus;

struct Number
{
    int num;
    Number(int num) : num(num) {}
    Number(const std::string& str) : num(std::stoi(str)) {}
};
Number result;

void *parse_INPUT(const SymbolContext<char>& ctx)
{
    result = ctx.value<Number>(1)->num;
}
void *parse_EXPR(const SymbolContext<char>& ctx)
{
    if (ctx.len() == 1)
        return ctx.value<Number>(1);
    else
        return new Number{ctx.value<Number>(1)->num + ctx.value<Number>(2)->num};
}
void *parse_TERM(const SymbolContext<char>& ctx)
{
    if (ctx.len() == 1)
        return ctx.value<Number>(1);
    else
        return new Number{ctx.value<Number>(1)->num * ctx.value<Number>(2)->num};
}
void *parse_FACT(const SymbolContext<char>& ctx)
{
    if (ctx.start()[0] == '(')
        return ctx.value<Number>(1);
    else
        return new Number(ctx.read());
}

Step 3. Generate and run the parser

int main(int argc, const char *argv[])
{
    Context<char> context{"/path/to/the/grammar.txt"};

    context.attach(L"INPUT", parse_INPUT);
    context.attach(L"EXPR", parse_EXPR);
    context.attach(L"TERM", parse_TERM);
    context.attach(L"FACT", parse_FACT);

    context.parse(argv[1], 1);

    printf("Result = %d\n", result);

    return 0;
}

What is with the name?

Centaurus is a bright constellation low down in the southern sky (in Japan). The alpha star of the constellation (Alpha Centauri, -0.1 mv) is also known as the closest star to Earth, 4.37 light years away. The constellation was inspired from an imaginary animal which is a chimera between a human and a horse.

The name Centaurus of the software comes from the fact that it is a chimera between a parser and a lexer, like the imaginary animal. The software is also seen as a chimera of PEG based methods (e.g. Packrat parsing) and LL(k) based methods, being the first and only software (as of 2018) to generate a scannerless and deterministic parser.

The naming also follows the tradition in the field of parser development that a parser has to be named after a four-legged (and often domesticated) animal. Yacc (LALR) is taken from yak, Bison (LALR/GLR) and Elkhound (GLR) from the animal with the same names and ANTLR (ALL(*)) from antlers (deer horns).

centaurus's People

Contributors

ihr486 avatar

Stargazers

Second Datke avatar Zheng Luo avatar Robin Voetter avatar

Watchers

SATO Shigeyuki avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.