vnmakarov / yaep Goto Github PK
View Code? Open in Web Editor NEWYet Another Earley Parser
License: Other
Yet Another Earley Parser
License: Other
Hi there. Great work! (too many c++only runtime parser generators out there)
Is there any chance to easily create rules at runtime? I mean, earley_parse_grammar(g, description)
accepts string as grammar description for now, but I would like to fill some BNF-shaped (recursive) struct
and use it in hypothetic earley_load_grammar(g, myBNF)
. I could create a patch myself, but I'm not sure my parser-generator background is strong enough to just start without any hints. Thanks!
Currently, make install
installs three header files, allocate.h
, cocom-config.h
, and yaep.h
, directly under /usr/local/include
. Especially allocate.h
and cocom-config.h
might clash with headers of the same name from other packages, so I propose to create a dedicated subdirectory /usr/local/include/yaep/
and put all the headers there.
Hi,
I follow these steps to install your library:
/configure --srcdir= --prefix=
cd src
make
make test (optional)
make install
But, How can I run it in terminal? (I use OSX)
Hey there, I am really struggling to print out the tree that the parser does produce. The root node the parse method provides, does not afford any children node connections rather some info and the code of the terminal character. Could you please explain me how can I get access to all the derivation trees of any ambiguously parsed string?
While creating a Homebrew formula for YAEP, I noticed that building the software in parallel fails sometimes.
mkdir build
cd build
# I use Ninja (https://ninja-build.org) below, since it seems to
# fail consistently (unlike e.g. `make` using the switch `-j9`,
# which only fails sometimes).
cmake -GNinja ..
ninja
The build system creates the library without any problems.
Ninja prints the following error message:
[12/211] Building C object src/CMakeFiles/yaep_test.dir/yaep.c.o
FAILED: src/CMakeFiles/yaep_test.dir/yaep.c.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -Isrc -DYAEP_TEST -DYAEP_DEBUG -std=gnu90 -MD -MT src/CMakeFiles/yaep_test.dir/yaep.c.o -MF src/CMakeFiles/yaep_test.dir/yaep.c.o.d -o src/CMakeFiles/yaep_test.dir/yaep.c.o -c ../src/yaep.c
../src/yaep.c:3497:10: fatal error: 'sgramm.c' file not found
#include "sgramm.c"
^~~~~~~~~~
1 error generated.
[16/211] Building CXX object src/CMakeFiles/yaep++_test.dir/yaep.cpp.o
FAILED: src/CMakeFiles/yaep++_test.dir/yaep.cpp.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -Isrc -DYAEP_TEST -DYAEP_DEBUG -std=gnu++11 -MD -MT src/CMakeFiles/yaep++_test.dir/yaep.cpp.o -MF src/CMakeFiles/yaep++_test.dir/yaep.cpp.o.d -o src/CMakeFiles/yaep++_test.dir/yaep.cpp.o -c ../src/yaep.cpp
In file included from ../src/yaep.cpp:61:
In file included from ../src/yaep.c:3497:
build/src/sgramm.c:398:2: error: unterminated conditional directive
#if (! defined yyoverflow \
^
../src/yaep.cpp:99:10: error: use of undeclared identifier 'yaep_parse_grammar'
return yaep_parse_grammar (this->grammar, strict_p, description);
^
2 errors generated.
[18/211] [BISON][bison_ansic] Building parser with bison 3.2.2
ansic.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
ninja: build stopped: subcommand failed.
.
10.14.1
3.13.1
When I parse a string with YAEP, I'm getting back an YAEP_NIL root node. I've enabled the debug logging, and it appears that it is parsing correctly, but is not returning the root of this tree. Here's the debug output:
Parsing start...
Set core = 0
1 $S : .error $eof, error, 0
2 $S : .S $eof, DT NN, 0
-----------
3 S : .NP0SBJ ADVP0TMP VP PUNCTUATION, DT NN, 0
4 S : .NP0SBJ VP PUNCTUATION, DT NN, 0
5 NP0SBJ : .DT NNS, DT, 0
6 NP0SBJ : .NN, NN, 0
Reading 0=NN(108), Current set=0
New set=1
Set core = 1
7 NP0SBJ : NN., RB VBP VBD, 1
9 S : NP0SBJ .VP PUNCTUATION, VBP VBD, 1
-----------
10 VP : .VBP PP0CLR, VBP, 0
11 VP : .VBD NP0PRD COMMA ADVP0CLR, VBD, 0
Reading 1=VBP(101), Current set=1
New set=2
Set core = 2
12 VP : VBP .PP0CLR, IN, 1
-----------
13 PP0CLR : .IN NP, IN, 0
Reading 2=IN(105), Current set=2
New set=3
Set core = 3
14 PP0CLR : IN .NP, CD NNS NNP JJ, 1
-----------
15 NP : .NP0QP PP0TMP, CD JJ, 0
16 NP : .NP PP, CD NNS NNP JJ, 0
17 NP : .NNP, NNP, 0
18 NP : .NNS, NNS, 0
19 NP : .NP0QP PP, CD JJ, 0
20 NP0QP : .CD CD, CD, 0
21 NP0QP : .JJ NNS, JJ, 0
Reading 3=NNS(102), Current set=3
New set=4
Set core = 4
22 NP : NNS., PUNCTUATION IN, 1
23 PP0CLR : IN NP., PUNCTUATION, 2
25 VP : VBP PP0CLR., PUNCTUATION, 3
26 S : NP0SBJ VP .PUNCTUATION, PUNCTUATION, 4
-----------
Reading 4=PUNCTUATION(104), Current set=4
New set=5
Set core = 5
27 S : NP0SBJ VP PUNCTUATION., $eof, 5
28 $S : S .$eof, $eof, 5
-----------
Reading 5=$eof(-1), Current set=5
New set=6
Set core = 6
29 $S : S $eof.,, 6
-----------
Translation:
0: EMPTY
Here's the grammar. I'm passing in string leifh
.
"\n"
"TERM DT = 97 COMMA = 98 RB = 99 CD = 100 VBP = 101 NNS = 102 NNP = 103 PUNCTUATION = 104 IN = 105 VBD = 106 JJ = 107 NN = 108;\n"
"S : NP0SBJ VP PUNCTUATION\n"
" | NP0SBJ ADVP0TMP VP PUNCTUATION\n"
" ;\n"
"NP : NP0QP PP\n"
" | NNS\n"
" | NNP\n"
" | NP PP\n"
" | NP0QP PP0TMP\n"
" ;\n"
"ADVP0CLR : RB PP\n"
" ;\n"
"VP : VBD NP0PRD COMMA ADVP0CLR\n"
" | VBP PP0CLR\n"
" ;\n"
"PP0CLR : IN NP\n"
" ;\n"
"PP : IN NP\n"
" ;\n"
"ADVP0TMP : RB\n"
" ;\n"
"PP0TMP : IN NP\n"
" ;\n"
"NP0SBJ : NN\n"
" | DT NNS\n"
" ;\n"
"NP0QP : JJ NNS\n"
" | CD CD\n"
" ;\n"
"NP0PRD : CD CD NNS\n"
" ;\n"
Everything looks OK to me during the parse. Am I doing something wrong? Thanks!
after make install
, i try
~$ echo '#include<yaep.h>' >test.c
~$ gcc test.c
In file included from test.c:1:0:
/usr/local/include/yaep.h:47:22: fatal error: allocate.h: No such file or directory
compilation terminated.
The Readme says to see the yaep.tst.in file, which does not seem to be part of the repository
It would be really nice if you could add a git tag, such as v1.0
, 1.0.0
or something similar, for the current version of YAEP. This would make it easier to create a binary package based on the repository code. For example, Homebrew requires a stable, tagged version for a formula (a description on how to build a Homebrew package, in a Ruby based Domain Specific Language).
Hi,
This project looks very interesting to me as I'm trying to build earley parsers with different strategies. @vnmakarov I was wondering if it's possible for you to briefly mention how the error handling algorithm works? Or a few pointers to where related code relies will also be useful!
Great project!
Thanks and Best!
Adding -Wall -Wextra
with gcc
give a few warnings and some of then I fixed as shown bellow:
diff --git a/src/yaep.c b/src/yaep.c
index 3719017..b68cce8 100644
--- a/src/yaep.c
+++ b/src/yaep.c
@@ -5263,8 +5263,8 @@ static vlo_t *tnodes_vlo;
static struct yaep_tree_node *
prune_to_minimal (struct yaep_tree_node *node, int *cost)
{
- struct yaep_tree_node *child, *alt, *next_alt, *result;
- int i, min_cost;
+ struct yaep_tree_node *child, *alt, *next_alt, *result = NULL;
+ int i, min_cost = 0;
assert (node != NULL);
switch (node->type)
@@ -5431,7 +5431,7 @@ make_parse (int *ambiguous_p)
struct yaep_tree_node *parent_anode, *anode, root_anode;
int parent_disp;
int saved_one_parse_p;
- struct yaep_tree_node **term_node_array;
+ struct yaep_tree_node **term_node_array = NULL;
#ifndef __cplusplus
vlo_t stack, orig_states;
#else
@@ -6420,7 +6420,7 @@ read_rule (const char ***rhs, const char **anode, int *anode_cost,
/* The following variable is the current number of next input
token. */
-static int ntok;
+static size_t ntok;
/* The following function imported by Earley's algorithm (see comments
in the interface file). */
Hi,
I am considering using yaep to add limited user-customizable syntax to a domain-specific language. In #24 (comment), you express concern that a non-trivial expertise is required to avoid unintended ambguities in grammars. If I understand correctly, this is because Earley parsing is much slower on ambiguous grammars. In that case, would it be feasible to have yaep abort parsing once ambiguity is detected, ideally with an informative error message?
Thanks,
Andres
It is stated that YAEP doesn't use Joop Leo's optimization, and that right recursion is fast anyways. And because of the lookahead I do believe this to be true for some grammars, but not all LR(k) grammars. For example, the following LR(2) grammar:
S -> A 'a' 'b'
A -> 'a' A
A ->
Generates increasingly large set cores when parsing "aaa...aaab"
, leading to (at least) quadratic behavior.
Perhaps an optimization similar to Joop Leo's could be made, making YAEP linear time for all LR(k) grammars?
There is some confusion about which license(s) apply to YAEP.
/usr/share/common-licenses/(GPL|LGPL)
, which, in my case, are symlinks to the newest versions (3.0) of the respective licenses.@vnmakarov could you please clarify:
I can make a pull request fixing those headers/copyright files for you once you indicate what you want.
https://link.springer.com/content/pdf/10.1007%2F3-540-61053-7_68.pdf
I hope this is useful.
Clone the Repository
git clone git clone https://github.com/vnmakarov/yaep.git
Compile and Install YAEP using Clang
./configure
cd src
make
make test
sudo make install
The last command should work without problems.
The last command fails printing the following output:
bison -y ./sgramm.y
mv y.tab.c sgramm.c
clang -c -g -Ofast -DHAVE_CONFIG_H -I.. -I. -I. -DNDEBUG -fPIC -o C/yaep.o ../src/../src/yaep.c
make: clang: Command not found
make: *** [C/yaep.o] Error 127
OS: Ubuntu Trusty Tahr
Compiler: Clang 5.0
What is the best way to make sure my application memory management doesn't interfere with the speed (or function) of yaep's memory management?
I'm not sure the best way to work with your memory manager (OS_CREATE etc)...
yaep_parse has parse_alloc and parse_free functions (i read from yaep.txt that these functions are only used for the translation (*root) memory.
i note your test code (yaep.tst.in) defines
static void *
test_parse_alloc (int size)
{
void *result;
OS_TOP_EXPAND (mem_os, size);
result = OS_TOP_BEGIN (mem_os);
OS_TOP_FINISH (mem_os);
return result;
}
and you pass yaep_parse and NULL for the yaep_parse functions
it seems you've optimized memory management and i don't feel as though i should use malloc/free in my own code in case i change your memory management characteristics.
but in the end, i'm going to link libyaep.a into code that uses malloc/realloc/free anyway, so...
what is the best way to make sure my application memory management doesn't interfere with the speed (or function) of yaep's memory management?
The following mode has problems (ignoring the glaring obvious issues with memory cleanup):
yaep_set_one_parse_flag(g,0);
yaep_set_cost_flag(g,1);
The attached code yields:
Translation:
0: ALTERNATIVE: node=1, next=2
1: ABSTRACT: a2 ( 3 4 )
3: TERMINAL: code=97, repr='a'
4: TERMINAL: code=98, repr='b'
2:
while it should yield something like
Translation:
0: ALTERNATIVE: node=1, next=2
1: ABSTRACT: a1 ( 3 4 )
3: TERMINAL: code=97, repr='a'
4: TERMINAL: code=98, repr='b'
2: ALTERNATIVE: node=5, next=nil
5: ABSTRACT: a2 ( 3 4 )
While preparing #16, I noticed that the build/testing system is a bit onerous to work with. Specifically, the various *.in
files appear to be some intermediate products (from autoconf?), probably the result of YAEP being part of a different software suite before it was split off as a standalone package. A particular headache were the bulky *.tst.in
test files.
So before I continue working on #12, my next step would be to overhaul the building and testing process. I'm thinking CMake/CTest: it's compatible to more environments than autoconf, supports separate out-of-tree builds for C and C++ easily, and is generally less pain to work with. I'll also split the *.tst.in
files into smaller files and put them in a separate directory.
YAEP currently manipulates a lot of static variables during its operation. This makes it impossible to use multiple parsers asynchronously.
One simple preliminary fix would be to declare all these variables thread_local
. While this does not make libyaep reentrant, it would at least become possible to use multiple parsers concurrently, as long as each parser has a thread of its own and stays on that thread. This preliminary fix would, strictly speaking, break the API, though, because it would no longer be possible to use the same parser from different threads.
The better fix would be to make libyaep fully reentrant by eliminating all static variables. Unless extra parameters are required for this for exposed functions, this could be implemented without breaking the API. Parsers could then be used asynchronously and, as long the caller adheres to a few simple rules, would also be thread-safe.
I made an Earley engine that is much faster in benchmarks than YAEP: gearley. Here are the results for 88248 tokens (10K lines of C).
gearley: bench_parse_c ... bench: 53,125,029 ns/iter (+/- 3,652,318)
gearley with CompactBocage: bench_parse_c ... bench: 87,122,025 ns/iter (+/- 2,744,845)
YAEP: parse time 0.38
Memory use
gearley: all:12669488b forest:12664888b
gearley with CompactBocage: all:2.217MB forest:2.212MB
YAEP: memory=12.724MB
So, the Earley sets take up only 4600 bytes. That's one byte every 19 tokens. Gearley with a compact forest takes only 17.4% of the memory YAEP does. However, a compact forest slows down the benchmark by 34 milliseconds. With normal forest, the parse takes 14% time compared to a parse with YAEP, and the same amount of memory.
Compact forest takes 25 bytes per token. Gearley with the default forest takes 602 ns per token.
I invite you to make your own benchmarks.
The following very simple program:
#include<yaep.h>
int main( void ) {
struct grammar * g1 = yaep_create_grammar();
yaep_free_grammar( g1 );
struct grammar * g2 = yaep_create_grammar();
yaep_free_grammar( g2 );
}
leaks memory. It's not quite clear to me what happens here. Apparently, yaep_free_grammar()
deallocates some substructures of struct grammar
, but not the memory for struct grammar
itself (the first two lines of main()
are sufficient to expose this problem). But in addition, yaep_create_grammar()
and yaep_free_grammar()
seem to handle a global grammar
object (which kind of defeats the purpose of having multiple grammars) and thus some memory becomes lost permanently.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.