Comments (10)
Hey, and thanks for your report!
You're right, unless the current behaviour is documented this should probably be considered as a bug.
The reason for the current behaviour is that the epsilon derivations of a rule r
(those that don't consume any input) are computed assuming that r
has no epsilon derivations. That is why the part using r
recursively in your grammars doesn't contribute to the result.
The first idea that might spring to mind then is to tie the knot when computing epsilon derivations and assume that r
's epsilon derivations are the result of the computation while computing the epsilon derivations. The problem with this approach is that it makes left-recursive grammars loop.
So instead I'm experimenting with an idea of productivity: While traversing the production looking for epsilon derivations, keep track of whether we've produced any results yet. When encountering r
again after we've already produced some results it's safe to tie the knot.
A problem with this is that it's potentially order-dependent:
r <- rule $ pure 1 <|> r
would return [1, 1, 1, ...]
because r
comes after pure 1
, while
r <- rule $ r <|> pure 1
would return [1]
because we haven't produced anything yet when encountering r
.
Maybe we could reorder the alternatives to make also this return an infinite list but I'm worried that that might not work in general, because you don't know what order to use if there are several productive alternatives.
Do you have any input or ideas? Do you have any more good test cases or important use cases?
Also, what do you mean by "I don't think it's possible to have this work and ... not produce an infinite list of results"? If the first one should return [1, 2, 3, ...]
, shouldn't the second return [1, 1, 1, ...]
?
Cheers!
from earley.
If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?
That's what I mean
It should be possible to expose multiple variants of rule:
- use a HashMap or StableNames etc to dedup results
- return infinite results
- the current behavior
- just loop
- something else??
I'm working on a rewrite of the parser where all but 2 should be easy to implement. Any ideas for other variants?
My rewrite doesn't compute epsilon derivations or interpret Prod
s and instead has instances directly on
newtype Parser s e i a = Parser {
unParser ::
ParseEnv s e i ->
(a -> ParseEnv s e i -> ST s (ParseEnv s e i)) ->
ST s (ParseEnv s e i)
}
including Monad and a fix :: (Parser s e i a -> Parser s e i a) -> Parser s e i a
function
So far, computing epsilon derivations has been avoidable but I think it might be unavoidable for infinite results
from earley.
Hmm, I'm not too fond of the idea of exposing several variants of rule
, unless I can be convinced that there are important use cases for such variants. I think I'd prefer this library just to do the Right Thing, provided we can figure out what that should be. Of course, that doesn't stop you from doing these things in your projects. :)
I think I've come up with something better than what I outlined in my previous post. What do you think of the following?
- Compute epsilon derivations for
r
assuming thatr
's epsilons are[]
(i.e. the current behaviour). - If step 1 produces any results, compute the epsilon derivations again but tie the knot.
This scheme supports left-recursion without looping provided the rule doesn't have any epsilon derivations, and additionally makes your two grammars produce infinite results. It makes r <- r <|> pure 1
loop if you try to force the results but you can still get a report
, which I think is reasonable.
So to me it seems promising --- I just have to convince myself that this is the right thing to do in general. I have an implementation here, and the commit before has an implementation of the scheme from my previous message.
Your rewrite sounds intriguing. I'd be interested to hear how you're handling things like left-recursion in it! :)
from earley.
A wip version of my rewrite is at https://gist.github.com/fread2281/256e47aff8903d7da98d9ea6b4cff63f. A couple things:
- Currently it doesn't compute epsilons and instead builds up a Results structure differently. I'd like to compute epsilons and expose a
bindResults :: Parser s e i a -> ([a] -> Parser s e i b) -> Parser s e i b
so the user can capture ambiguity errors in subexpressions and give nicer error messages (e.g. a language with infix operators, that prints the possible parses when an operator expression is ambiguous). I think epsilons should be possible to do final-encoded, but I haven't worked it out yet - It provides instances directly instead of building and interpreting a GADT. GHC should be able to optimize it more, but it's a different interface.
- I haven't yet looked at the order in which it returns results
from earley.
Here's a problem(?) with my implementation, that I'm trying to figure out how to fix. I think Earley has the same problem? If not, it'd be very helpful if you could try to explain how Earley avoids it. I've benchmarked my implementation and Earley and they both have the same order of magnitude of function calls (of the most commonly called functions, but it's still within n^3
i think) in the prof
files, so I think it still has this problem.
Can merge results at same rule w/ diff start pos & same end pos!
(if a calls b at pos i and j, and i-k & j-k both return results, only need to return once with merged results)
however, with fuly binarized/memoized grammars, ignoring alts, we can assume that a = b <*> c
, but c
must be a rule, which does merging for us
But this still adds too many callbacks to c
(if b
returns at i-k
& j-k
, then c
gets two instances of a
in its conts), and callbacks in c
cost per position where c
succeeds starting from k
.
Practical, General Parser Combinators (Meerkat) avoids this by using a HashSet of Conts in rules.
However, this might not cause worst case complexity to go over cubic with fully binarized grammars. According to the meerkat paper,
In Appendix A.3 we show that the execution of memoized CPS
recognizers of Figure 2 and 6 can require O(n^(m+1)) operations, where
m is the length of the longest rule in the grammar. The reason for
such unbounded polynomial behaviour is that the same continuation
can be called multiple times at the same input position. As illustrated
in Appendix A.3, this happens when the same continuation is
added to different continuation lists that are associated with calls
made to the same recognizer but at different input positions.
If the recognizer produces the same input position starting from different
input positions, duplicate calls are made. The duplicate calls further
result in adding the same continuations to the same continuation
lists multiple times
I think it still affects non-worst case complexity though :(
I don't know if this is avoidable without using StableNames or similar to compare Conts
from earley.
If I understand you correctly, the question is if we can merge the continuations of invocations of the same rule that start at different positions and end at the same position. Is this what you're asking?
This library doesn't do that, but it doesn't seem to me to be a sound optimisation. Maybe I misunderstood what you meant, though.
from earley.
There's two ways to deal with the issue in the example:
- Merge the continuations of invocations of the same rule that start at different positions and end at the same position and come from the same position in their parent rule: this requires extra bookkeeping per rule, and some way to compare where in the grammar conts return to.
- Merge continuations of a rule that return to the same position in the grammar & start from the same position: this still needs some way to compare where in the grammar conts return to.
In the example, 2 would be merging the continuations after they get added to c
, so when c
returns results in the future it costs less, while 1 would be merging the continuations as b
returns (before c
gets called).
The Meerkat paper does 2, but I think 1 would work too.
I think they should have the same asymptotics, except with a Monad instance, where 1 can do better.
from earley.
Maybe you could tie the knot, but then do a breadth first search instead of a depth first search. That way, both r <- rule $ (pure 1 <|> ((+1) <$> r))
and r <- rule $ (((+1) <$> r) <|> pure 1)
would return infinite lists. In fact, it would work with sensibly with any binary tree with infinitely many leaves (although a binary tree with infinitely many nodes but only finitely many leaves would diverge after returning all of its leaves).
from earley.
Yeah, that might work actually!
from earley.
Also, an idea on how to do that efficiently. First you create a value representing a lazy binary tree, and then you search that binary tree using iterative deepening search. Then, if the compiler fuses the binary tree, it would be equivalent to an imperative iterative deepening search. If it chooses to keep the binary tree in memory instead, it would be equivalent to an imperative breadth first search (except that you are scanning the tree itself in a iterative deepening manner). An optimizing compiler (like GHC) can choose which one is better based on the cost of recalculating the values in the tree, I think.
from earley.
Related Issues (20)
- Q: How to do error recovery? HOT 1
- Is Earley’s implementation related to Marpa? HOT 3
- Add a `eof` terminal HOT 5
- Behavior of Semigroup/Monoid instances for Prod? HOT 5
- Capture result as well as matched tokens HOT 2
- Generate ABNF/EBNF from Earley grammar HOT 1
- Suggestion: Make it so that `<$` rules return at most one result. HOT 3
- Parsing into a list delimited by a token HOT 5
- INDENT and DEDENT HOT 5
- feature: constraint validation of production parse result HOT 7
- Please add tags "test ambiguity" etc HOT 2
- Crashes if grammar is infinitely ambiguous on given input HOT 11
- Is it okey to use "many" and heavily use Applicative parsing techniques? HOT 2
- Diverges with sepEndBy from parser-combinator HOT 1
- Report doesn't include enough information HOT 2
- Is there any possible optimization for getting non-exhaustive outputs for ambiguous grammars HOT 1
- Disambiguate/Filter productions HOT 1
- Missing case for Constraint for fmap and <*> HOT 3
- Something like Ruby Slippers parsing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from earley.