lagodiuk / earley-parser-js Goto Github PK

Tiny JavaScript implementation of context-free languages parser - Earley parser (including generation of the parsing-forest).

License: Apache License 2.0

HTML 22.66% JavaScript 51.58% CSS 25.76%

earley-parser syntax-parser context-free-grammar english-grammar german-grammar

earley-parser-js's Introduction

earley-parser-js

Tiny JavaScript implementation of context-free languages parser - Earley parser

General information about Earley parsing algorithm
Online demo
2.1 Parser of the tiny subset of German grammar
2.2 Parser of the tiny subset of English grammar
2.3 Parser of arithmetic expressions
Quick start
3.1 Grammar with hardcoded terminal symbols
3.2 Customizing logic of tokens classification into terminal symbols
3.3 Traversing parsed trees (parsing-forest)
3.4 Parsing tiny subset of English language grammar
Used Tools and Libraries
This library is used by...

General information

The Earley parser is an algorithm for parsing strings that belong to a given context-free language (the algorithm, named after its inventor, Jay Earley).

This algorithm is appealing because it can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages.

Complexity of Earley parsing algorithm (in terms of n - the length of the parsed string):

O(n³) - cubic time in the general case
O(n²) - quadratic time for unambiguous grammars
O(n) - linear time for almost all LR(k) grammars

Earley parser performs particularly well when the rules are written left-recursively.

Online demo

Parser of the tiny subset of German grammar

Parser of the tiny subset of German grammar: http://lagodiuk.github.io/nlp/deutsch.html

Parser of the tiny subset of English grammar

Parser of a tiny subset of English grammar: https://jsfiddle.net/2mb3w9c1/4/embedded/result/

Parser of arithmetic expressions

Parser of arithmetic expressions: https://jsfiddle.net/vsf982m9/embedded/result/

var grammar = new tinynlp.Grammar([
   'R -> N',
   'S -> S add_sub M | M',
   'M -> M mul_div T | T',
   'N -> S lt_gt S | S',
   'T -> num | ( S )',
]);

grammar.terminalSymbols = function(token) {
   if ('<' === token || '>' === token) return ['lt_gt'];
   if ('+' === token || '-' === token) return ['add_sub'];
   if ('*' === token || '/' === token) return ['mul_div'];
   if ('(' === token) return ['('];
   if (')' === token) return [')'];
   // Otherwise - token considered as a number:
   return ['num'];
}

Usage

Attach to your project - single file with implementation of Earley algorithm: https://rawgithub.com/lagodiuk/earley-parser-js/master/earley-oop.min.js

Grammar with hardcoded terminal symbols

// Define grammar
var grammar = new tinynlp.Grammar([
     // Define grammar production rules
     'R -> S',
     'S -> S add_sub M | M | num',
     'M -> M mul_div T | T | num',
     'T -> num',
     
     // Define terminal symbols
     'num -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0',
     'add_sub -> + | -',
     'mul_div -> * | /'
]);

// You have to tokenize input by yourself!
// Creating array of tokens
var tokens = '2 + 3 * 4'.split(' ');

// Parsing
var rootRule = 'R';
var chart = tinynlp.parse(tokens, grammar, rootRule);

// Get array with all parsed trees
// In case of ambiguous grammar - there might be more than 1 parsing tree
var trees =  chart.getFinishedRoot(rootRule).traverse();

// Iterate over all parsed trees and display them on HTML page
for (var i in trees) {
     console.log(JSON.stringify(trees[i]))
}

Customizing logic of tokens classification into terminal symbols

Potentially, this approach allows to extend parser with custom classifier of tokens - into terminal symbols (e.g. recognize terminal symbols using Regular expressions or more sophisticated classifiers):

var grammar = new tinynlp.Grammar([
    'R -> S',
    'S -> S add_sub M | M | num',
    'M -> M mul_div T | T | num',
    'T -> num',
]); 

// Define function, which will classify tokens into terminal types
grammar.terminalSymbols = function( token ) { 
    // Actually, this method might be customized 
    // to use some more sophisticated classification mechanisms
    
    if( '+' === token || '-' === token ) return ['add_sub'];
    if( '*' === token || '/' === token ) return ['mul_div'];
    if( token.match(/^\d+$/) ) return ['num'];
    // It is also possible that classifier returns 
    // more than one terminal symbol, e.g.: ['term1', 'term2']
    
    // Otherwise:
    throw new Error("Can't recognize token: " + token);
}   

// You have to tokenize input by yourself!
// Creating array of tokens
var tokens = '2 + 3 * 4'.split(' ');

// Parsing
var rootRule = 'R';
var chart = tinynlp.parse(tokens, grammar, rootRule);

// Get array with all parsed trees
// In case of ambiguous grammar - there might be more than 1 parsing tree
var trees =  chart.getFinishedRoot(rootRule).traverse();

// Iterate over all parsed trees and display them on HTML page
for (var i in trees) {
     console.log(JSON.stringify(trees[i]))
}

Traversing parsed trees

Following snippet shows how to transform parsed trees into nested HTML lists:

function toNestedList(tree) {
   if (!tree.subtrees || tree.subtrees.length == 0) {
       return '<li>' + tree.root + '</li>';
   }   
   var builder = []; 
   builder.push('<li>');
   builder.push(tree.root);
   builder.push('<ul>')
   for (var i in tree.subtrees) {
       builder.push(toNestedList(tree.subtrees[i]))
   }   
   builder.push('</ul>')
   builder.push('</li>')
   return builder.join('');
}

Example of usage:

// Iterate over all parsed trees and display them on HTML page
for (var i in trees) {
     htmlRepresentstion = '<ul>' + toNestedList(trees[i]) + '</ul>'
     // embed htmlRepresentstion into HTML page
}

Parsing tiny subset of English language grammar

Grammar taken from: https://en.wikipedia.org/wiki/CYK_algorithm#Example

var grammar = new tinynlp.Grammar([
    'S -> NP VP',
    'VP -> VP PP | V NP | V',
    'PP -> P NP',
    'NP -> Det N | N'
]);

grammar.terminalSymbols = function(token) {
    if(token == 'eats') return ['V'];
    if(token == 'fish') return ['N'];
    if(token == 'fork') return ['N'];
    if(token == 'she') return ['N'];
    if(token == 'a') return ['Det'];
    if(token == 'with') return ['P'];
    // otherwise:
    return [];
}

// Tokenizing sentence
var tokens = 'she eats a fish with a fork'.split(' ');

// Parsing
var rootRule = 'S';
var chart = tinynlp.parse(tokens, grammar, rootRule);

// Get array with all parsed trees
// In case of ambiguous grammar - there might be more than 1 parsing tree
var trees =  chart.getFinishedRoot(rootRule).traverse();

for (var i in trees) {
   // visit each parse tree
}

Used Tools and Libraries

Implementation of the Earley Parser is self-sufficient (https://github.com/lagodiuk/earley-parser-js/blob/master/earley-oop.js).

However, for producing of the minimized variant of library was used the Google Closure Compiler Service.

Additionally, some of the demo examples are using:

jQuery
D3.js

Some of the demo examples are currently hosted on https://jsfiddle.net/

This library is used by...

The web site of the book "Formale Sprachen, abstrakte Automaten und Compiler: Lehr- und Arbeitsbuch für Grundstudium und Fortbildung" (by Christian Wagenknecht (Autor), Michael Hielscher (Mitwirkende)): https://flaci.com/home/ The library is used by this web site for creation of the browser environment, where users can experiment with different Context Free Grammars (define grammars and produce the parsing trees on the fly)

earley-parser-js's People

Contributors

Stargazers

Watchers

Forkers

recursuc borispovod pj-gomez lagleki drjoeavg joe-suh d-bogachev ull-esit-pl fizzstudio flowerfield leqnam

earley-parser-js's Issues

Needed mechanism for incremental generation of parsing trees

Description of existing functionality:
Currently, parsing trees from the parsing forest - are being generated all at once. In the worst case - such behaviour might lead to exponential runtime complexity and exponential consumption of memory (in case of ambiguous grammar).

TODO:
It is possible to avoid exponential consumption of memory - by developing logic for incremental generation of parsing trees (e.g. using event-handler based approach). This change - requires slight reworking of parsing forest generation logic.

EPSILON case for right-hand-side

The parser can deal sometimes with empty rhs (EPSILON case) but sometimes it does not work properly. In the code and in all examples there is no EPSILON case mentioned so it is possible to use it?

For example with input "abba" (does work - results in 5 different trees):
X -> a Y | b Y
Y -> ɛ | X | X Y

without the second X (does not work at all):
X -> a Y | b Y
Y -> ɛ | X Y

You can also try it here: http://flaci.com/kfgedit just click on new grammar on the lower right corner.

Grammars with EPSILON issues

I just found an example where parsing with the new epsilon version breaks:

S -> T
T -> a T E | z
E -> EPSILON

the word "aaaaz" cannot be parsed successfully but if I use only "T -> a T | z" it works fine. This was the smallest grammar I could think of showing the problem.
Rules with single epsilon work great otherwise like this on with input "a" is fine:
S -> a A
A -> EPSILON

also switching "T" and "a" works with "zaaaa":
S -> T
T -> T a E | z
E -> EPSILON

The nonterminal E is at the same place with the same rule but this time it works. Here is another broken one which might be the same problem:

N0 -> N1 b | b
N1 -> b X | a X | a N1 X
X -> b X | EPSILON

the word "aaab" does not work but should. I know the grammars do not look useful at all :-)

I tried to figure out why this happens but I wasn't able to find it.

The circular structure appears inside the parsing chart

Error during parsing of the following grammar:

S -> A
A -> ( A ) | A A | _EPSILON_

The circular structure appears inside the parsing chart (the message from log: "earley-oop.min.js:5 Uncaught TypeError: Converting circular structure to JSON"). Despite the fact that the circular structure was detected during the log output, actually the algorithm must not produce the states with circular references (because these references are used during backtracking, for the purpose of reconstruction of the parsing forest).

Link to example: https://jsfiddle.net/2mb3w9c1/49/

However, the following grammar can be successfully parsed:

S -> A
A -> ( A ) | A A | ( )

Hence, the issue is related to handling of the epsilon transitions.

get index of rule in parse tree

how to get index of rule that used for parsing a subtree ( this is useful in case that a parser can not easily determine how to parse a subtree )

Idea regarding the exhaustive integration testing of the parser

As far as parser could handle potentially all possible context free grammars (CFG), and can parse all possible valid statements of the corresponding CFG, it was a bit unclear how to find the proper and exhaustive approach for testing the implementation of the parser.

Few days ago I've came up with an idea regarding the exhaustive testing of the parser.
Of course, as far as there is an infinite amount of CFGs, the idea of testing is based on randomization principle.

Hence, the algorithm:

Repeat multiple times:
1. Generate random CFG
1.1. Using generated CFG - produce the random valid statement, which belongs to the given CFG. It can be done using the recursive top-down approach (also, meanwhile, the top-down approach - helps to generate the correct parse tree as well).
1.2. Parse generated statement using the implementation of the parsing algorithm. Assert, that expected parse tree is present inside the generated parse forest (because, in case of ambiguous CFG - there might be generated multiple parse trees).
1.3. go to 1.1.

Adding and removing rules from a grammar

I'm developing an application that requires the grammar rules to be modified after the grammar is created. After creating a grammar like this one, would it be possible to add or modify individual rules in the grammar?

var grammar = new tinynlp.Grammar([
   'R -> N',
   'S -> S add_sub M | M',
   'M -> M mul_div T | T',
   'N -> S lt_gt S | S',
   'T -> num | ( S )',
]);

Would I be able to add a new rule to the grammar using grammar.add('A -> B') or something similar?

Implement the option for unification of grammatical features (useful for NLP)

It would be useful (for purpose of NLP) to have an option for unification of grammatical features, which is described in the book "Speech and Language Processing" by Jurafsky and Martin.

lookahead support?

I know that it was removed long ago in previous implementations but for those who keep in mind performance issues can lookahead and lookbehind be added as a feature?

Extract full parsing forest, even if root production rule consists of few alternatives

Description of existing behaviour:

Currently parser generates full parsing forest only in case, if root production rule doesn't have alternatives.

Described specificity requires, that root production rule should be defined without alternatives. For example:

Root -> X
X -> A | B | C
A -> ...
B -> ...
...

On the other hand, parser will generate only subset of the parsing forest - if grammar will be expressed in the following form:

Root -> A | B | C
A -> ...
B -> ...
...

TODO:
This limitation can be relaxed by careful refactoring of existing code (no algorithmic changes needed).

Parsing Error JSON

Tried use JSFiddle to parse this grammar:

S -> a S S | S S | a a | S

Word:

a a

This a ambiguos grammar example.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.