Git Product home page Git Product logo

Comments (10)

ollef avatar ollef commented on June 26, 2024

Hey, and thanks for your report!

You're right, unless the current behaviour is documented this should probably be considered as a bug.

The reason for the current behaviour is that the epsilon derivations of a rule r (those that don't consume any input) are computed assuming that r has no epsilon derivations. That is why the part using r recursively in your grammars doesn't contribute to the result.

The first idea that might spring to mind then is to tie the knot when computing epsilon derivations and assume that r's epsilon derivations are the result of the computation while computing the epsilon derivations. The problem with this approach is that it makes left-recursive grammars loop.

So instead I'm experimenting with an idea of productivity: While traversing the production looking for epsilon derivations, keep track of whether we've produced any results yet. When encountering r again after we've already produced some results it's safe to tie the knot.

A problem with this is that it's potentially order-dependent:

r <- rule $ pure 1 <|> r

would return [1, 1, 1, ...] because r comes after pure 1, while

r <- rule $ r <|> pure 1

would return [1] because we haven't produced anything yet when encountering r.

Maybe we could reorder the alternatives to make also this return an infinite list but I'm worried that that might not work in general, because you don't know what order to use if there are several productive alternatives.

Do you have any input or ideas? Do you have any more good test cases or important use cases?

Also, what do you mean by "I don't think it's possible to have this work and ... not produce an infinite list of results"? If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

Cheers!

from earley.

acertain avatar acertain commented on June 26, 2024

If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

That's what I mean

It should be possible to expose multiple variants of rule:

  1. use a HashMap or StableNames etc to dedup results
  2. return infinite results
  3. the current behavior
  4. just loop
  5. something else??

I'm working on a rewrite of the parser where all but 2 should be easy to implement. Any ideas for other variants?

My rewrite doesn't compute epsilon derivations or interpret Prods and instead has instances directly on

newtype Parser s e i a = Parser {
  unParser ::
    ParseEnv s e i ->
    (a -> ParseEnv s e i -> ST s (ParseEnv s e i)) ->
    ST s (ParseEnv s e i)
}

including Monad and a fix :: (Parser s e i a -> Parser s e i a) -> Parser s e i a function

So far, computing epsilon derivations has been avoidable but I think it might be unavoidable for infinite results

from earley.

ollef avatar ollef commented on June 26, 2024

Hmm, I'm not too fond of the idea of exposing several variants of rule, unless I can be convinced that there are important use cases for such variants. I think I'd prefer this library just to do the Right Thing, provided we can figure out what that should be. Of course, that doesn't stop you from doing these things in your projects. :)

I think I've come up with something better than what I outlined in my previous post. What do you think of the following?

  1. Compute epsilon derivations for r assuming that r's epsilons are [] (i.e. the current behaviour).
  2. If step 1 produces any results, compute the epsilon derivations again but tie the knot.

This scheme supports left-recursion without looping provided the rule doesn't have any epsilon derivations, and additionally makes your two grammars produce infinite results. It makes r <- r <|> pure 1 loop if you try to force the results but you can still get a report, which I think is reasonable.

So to me it seems promising --- I just have to convince myself that this is the right thing to do in general. I have an implementation here, and the commit before has an implementation of the scheme from my previous message.

Your rewrite sounds intriguing. I'd be interested to hear how you're handling things like left-recursion in it! :)

from earley.

acertain avatar acertain commented on June 26, 2024

A wip version of my rewrite is at https://gist.github.com/fread2281/256e47aff8903d7da98d9ea6b4cff63f. A couple things:

  • Currently it doesn't compute epsilons and instead builds up a Results structure differently. I'd like to compute epsilons and expose a bindResults :: Parser s e i a -> ([a] -> Parser s e i b) -> Parser s e i b so the user can capture ambiguity errors in subexpressions and give nicer error messages (e.g. a language with infix operators, that prints the possible parses when an operator expression is ambiguous). I think epsilons should be possible to do final-encoded, but I haven't worked it out yet
  • It provides instances directly instead of building and interpreting a GADT. GHC should be able to optimize it more, but it's a different interface.
  • I haven't yet looked at the order in which it returns results

from earley.

acertain avatar acertain commented on June 26, 2024

Here's a problem(?) with my implementation, that I'm trying to figure out how to fix. I think Earley has the same problem? If not, it'd be very helpful if you could try to explain how Earley avoids it. I've benchmarked my implementation and Earley and they both have the same order of magnitude of function calls (of the most commonly called functions, but it's still within n^3 i think) in the prof files, so I think it still has this problem.

Can merge results at same rule w/ diff start pos & same end pos!
(if a calls b at pos i and j, and i-k & j-k both return results, only need to return once with merged results)
however, with fuly binarized/memoized grammars, ignoring alts, we can assume that a = b <*> c, but c must be a rule, which does merging for us

But this still adds too many callbacks to c (if b returns at i-k & j-k, then c gets two instances of a in its conts), and callbacks in c cost per position where c succeeds starting from k.

Practical, General Parser Combinators (Meerkat) avoids this by using a HashSet of Conts in rules.

However, this might not cause worst case complexity to go over cubic with fully binarized grammars. According to the meerkat paper,

In Appendix A.3 we show that the execution of memoized CPS
recognizers of Figure 2 and 6 can require O(n^(m+1)) operations, where
m is the length of the longest rule in the grammar. The reason for
such unbounded polynomial behaviour is that the same continuation
can be called multiple times at the same input position. As illustrated
in Appendix A.3, this happens when the same continuation is
added to different continuation lists that are associated with calls
made to the same recognizer but at different input positions.
If the recognizer produces the same input position starting from different
input positions, duplicate calls are made. The duplicate calls further
result in adding the same continuations to the same continuation
lists multiple times

I think it still affects non-worst case complexity though :(

I don't know if this is avoidable without using StableNames or similar to compare Conts

from earley.

ollef avatar ollef commented on June 26, 2024

If I understand you correctly, the question is if we can merge the continuations of invocations of the same rule that start at different positions and end at the same position. Is this what you're asking?

This library doesn't do that, but it doesn't seem to me to be a sound optimisation. Maybe I misunderstood what you meant, though.

from earley.

acertain avatar acertain commented on June 26, 2024

There's two ways to deal with the issue in the example:

  1. Merge the continuations of invocations of the same rule that start at different positions and end at the same position and come from the same position in their parent rule: this requires extra bookkeeping per rule, and some way to compare where in the grammar conts return to.
  2. Merge continuations of a rule that return to the same position in the grammar & start from the same position: this still needs some way to compare where in the grammar conts return to.

In the example, 2 would be merging the continuations after they get added to c, so when c returns results in the future it costs less, while 1 would be merging the continuations as b returns (before c gets called).

The Meerkat paper does 2, but I think 1 would work too.

I think they should have the same asymptotics, except with a Monad instance, where 1 can do better.

from earley.

ChristopherKing42 avatar ChristopherKing42 commented on June 26, 2024

Maybe you could tie the knot, but then do a breadth first search instead of a depth first search. That way, both r <- rule $ (pure 1 <|> ((+1) <$> r)) and r <- rule $ (((+1) <$> r) <|> pure 1) would return infinite lists. In fact, it would work with sensibly with any binary tree with infinitely many leaves (although a binary tree with infinitely many nodes but only finitely many leaves would diverge after returning all of its leaves).

from earley.

ollef avatar ollef commented on June 26, 2024

Yeah, that might work actually!

from earley.

ChristopherKing42 avatar ChristopherKing42 commented on June 26, 2024

Also, an idea on how to do that efficiently. First you create a value representing a lazy binary tree, and then you search that binary tree using iterative deepening search. Then, if the compiler fuses the binary tree, it would be equivalent to an imperative iterative deepening search. If it chooses to keep the binary tree in memory instead, it would be equivalent to an imperative breadth first search (except that you are scanning the tree itself in a iterative deepening manner). An optimizing compiler (like GHC) can choose which one is better based on the cost of recalculating the values in the tree, I think.

from earley.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.