ollef / earley Goto Github PK

View Code? Open in Web Editor NEW

360.0 21.0 24.0 259 KB

Parsing all context-free grammars using Earley's algorithm in Haskell.

License: BSD 3-Clause "New" or "Revised" License

Haskell 100.00%

earley's Introduction

Earley

Go to the API documentation on Hackage.

This (Text.Earley) is a library consisting of a few main parts:

Text.Earley.Grammar

An embedded context-free grammar (CFG) domain-specific language (DSL) with semantic action specification in applicative style.

An example of a typical expression grammar working on an input tokenised into strings is the following:

   expr :: Grammar r (Prod r String String Expr)
   expr = mdo
     x1 <- rule $ Add <$> x1 <* namedToken "+" <*> x2
               <|> x2
               <?> "sum"
     x2 <- rule $ Mul <$> x2 <* namedToken "*" <*> x3
               <|> x3
               <?> "product"
     x3 <- rule $ Var <$> (satisfy ident <?> "identifier")
               <|> namedToken "(" *> x1 <* namedToken ")"
     return x1
     where
       ident (x:_) = isAlpha x
       ident _     = False

Text.Earley.Parser

An implementation of (a modification of) the Earley parsing algorithm.

To invoke the parser on the above grammar, run e.g. (here using words as a stupid tokeniser):

   fullParses (parser expr) $ words "a + b * ( c + d )"
   = ( [Add (Var "a") (Mul (Var "b") (Add (Var "c") (Var "d")))]
     , Report {...}
     )

Note that we get a list of all the possible parses (though in this case there is only one).

Another invocation, which shows the error reporting capabilities (giving the last position that the parser reached and what it expected at that point), is the following:

   fullParses (parser expr) $ words "a +"
   = ( []
     , Report { position   = 2
              , expected   = ["(","identifier","product"]
              , unconsumed = []
              }
     )

Text.Earley.Generator

Functionality to generate the members of the language that a grammar generates.

To get the language of a grammar given a list of allowed tokens, run e.g.:

   language (generator romanNumeralsGrammar "VIX")
   = [(0,""),(1,"I"),(5,"V"),(10,"X"),(20,"XX"),(11,"XI"),(15,"XV"),(6,"VI"),(9,"IX"),(4,"IV"),(2,"II"),(3,"III"),(19,"XIX"),(16,"XVI"),(14,"XIV"),(12,"XII"),(7,"VII"),(21,"XXI"),(25,"XXV"),(30,"XXX"),(31,"XXXI"),(35,"XXXV"),(8,"VIII"),(13,"XIII"),(17,"XVII"),(26,"XXVI"),(29,"XXIX"),(24,"XXIV"),(22,"XXII"),(18,"XVIII"),(36,"XXXVI"),(39,"XXXIX"),(34,"XXXIV"),(32,"XXXII"),(23,"XXIII"),(27,"XXVII"),(33,"XXXIII"),(28,"XXVIII"),(37,"XXXVII"),(38,"XXXVIII")]

The above example shows the language generated by a Roman numerals grammar limited to the tokens 'V', 'I', and 'X'.

Text.Earley.Mixfix

Helper functionality for creating parsers for expressions with mixfix identifiers in the style of Agda.

How do I write grammars?

As hinted at above, the grammars are written inside Grammar, which is a Monad and MonadFix. For the library to be able to tame the recursion in the grammars, we have to use the rule function whenever a production is recursive.

Whenever you would write e.g.

...
p = foo <|> bar <*> p
...

in a conventional combinator parser library, you instead write the following:

grammar = mdo
  ...
  p <- rule $ foo <|> bar <*> p
  ...

Apart from making it possible to do recursion (even left-recursion), rules have an additional benefit: they control where work is shared, by the rule that any rule is only ever expanded once per position in the input string. If a rule is encountered more than once at a position, the work is shared.

Compared to parser generators and combinator libraries

This library differs from the main methods that are used to write parsers in the Haskell ecosystem:

Compared to parser generators (YACC, Happy, etc.) it requires very little pre-processing of the grammar. It also allows you to stay in the host language for both grammar and parser, i.e. there is no use of a separate tool. This also means that you are free to use the abstraction facilities of Haskell when writing a grammar. Currently the library requires a linear traversal of the grammar's rules before use, which is usually fast enough to do at run time, but precludes infinite grammars.
The grammar language is similar to that of many parser combinators (Parsec, Attoparsec, parallel parsing processes, etc.), providing an applicative interface, but the parser gracefully handles all finite CFGs, including those with left-recursion. On the other hand, its productions are not monadic meaning that it does not support context-sensitive or infinite grammars, which are supported by many parser combinator libraries.

Note: The Grammar type is a Monad (used to provide observable sharing) but it lives a layer above productions. It cannot be used to decide what production to use depending on the result of a previous production, i.e. it does not give us monadic parsing.

The parsing algorithm

The parsing algorithm that this library uses is based on Earley's parsing algorithm. The algorithm has been modified to produce online parse results, to give good error messages, and to allow garbage collection of the item sets. Essentially, instead of storing a sequence of sets of items like in the original algorithm, the modified algorithm just stores pointers back to sets of reachable items.

The worst-case run time performance of the Earley parsing algorithm is cubic in the length of the input, but for large classes of grammars it is linear. It should however be noted that this library will likely be slower than most parser generators and parser combinator libraries.

The parser implements an optimisation similar to that presented in Joop M.I.M Leo's paper A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead, which removes indirections in sequences of non-ambiguous backpointers between item sets.

For more in-depth information about the internals of the library, there are implementation notes currently being written.

Contact

Olle Fredriksson - https://github.com/ollef

earley's People

Contributors

Stargazers

Watchers

Forkers

phadej axure timjb gabriella439 sid-kap sboosali dagit kenkku bzhkl yws coenvk serokell hgldj1966 treeowl qrilka iomeone leifmetcalf strake rowhit ocharles ppkfs amirlb expipiplus1

earley's Issues

Generate all members of a grammar?

Intuitively, it feels like it should be possible for your library to generate all strings (as a potentially infinite list) that a grammar would parse, as well as all parsed values that those strings would generate. Is that something you've considered or would consider adding?

test-suite compilation failure when building from tarball

From building stackage nightly. I'll add this test-suite to expected failures for now.

Some modules seem to be missing from other-modules:

reprocessing library Earley-0.11.0.0...
[1 of 6] Compiling Text.Earley.Grammar ( Text/Earley/Grammar.hs, dist/build/Text/Earley/Grammar.o )
[2 of 6] Compiling Text.Earley.Derived ( Text/Earley/Derived.hs, dist/build/Text/Earley/Derived.o )
[3 of 6] Compiling Text.Earley.Internal ( Text/Earley/Internal.hs, dist/build/Text/Earley/Internal.o )
[4 of 6] Compiling Text.Earley.Parser ( Text/Earley/Parser.hs, dist/build/Text/Earley/Parser.o )
[5 of 6] Compiling Text.Earley      ( Text/Earley.hs, dist/build/Text/Earley.o )
[6 of 6] Compiling Text.Earley.Mixfix ( Text/Earley/Mixfix.hs, dist/build/Text/Earley/Mixfix.o )
In-place registering Earley-0.11.0.0...
Preprocessing test suite 'tests' for Earley-0.11.0.0...

tests/Main.hs:4:18:
    Could not find module ‘Empty’
    Use -v to see a list of the files searched for.

tests/Main.hs:5:18:
    Could not find module ‘Expr’
    Use -v to see a list of the files searched for.

tests/Main.hs:6:18:
    Could not find module ‘InlineAlts’
    Use -v to see a list of the files searched for.

tests/Main.hs:7:18:
    Could not find module ‘Issue11’
    Use -v to see a list of the files searched for.

tests/Main.hs:8:18:
    Could not find module ‘Issue14’
    Use -v to see a list of the files searched for.

tests/Main.hs:9:18:
    Could not find module ‘Mixfix’
    Use -v to see a list of the files searched for.

tests/Main.hs:10:18:
    Could not find module ‘Optional’
    Perhaps you meant
      Options (needs flag -package-key options-1.2.1.1@optio_CtVRCDNbaSzCLFqvgui815)
    Use -v to see a list of the files searched for.

tests/Main.hs:11:18:
    Could not find module ‘ReversedWords’
    Use -v to see a list of the files searched for.

tests/Main.hs:12:18:
    Could not find module ‘VeryAmbiguous’
    Use -v to see a list of the files searched for.

INDENT and DEDENT

Is there a way to match INDENT/DEDENT or make parsers take an indentation level?

func foo(a: A, b: B): C =
  theresSomeIndentRequiredHere(a, b)

func noMoreIndentMeansNoMoreFoo: D = ???

Add a `eof` terminal

Is it possible to have

eof :: Prod r e t ()

that matches the end of input?

I have a grammar that accepts expressions in parentheses:

    exprInBrackets <- rule $
      ExprInBrackets
        <$> leftBrack
        <*> ntExpr
        <*> rightBrack
        <?> "parenthesized expression"

And, given ([2]) it will produce the following value:

let inner = ExprInBrackets "[" (ExprNum 2) "]"
in ExprInBrackets "(" inner ")"

I'd like to modify it like this:

    exprInBrackets <- rule $
      ExprInBrackets
        <$> leftParen
        <*> ntExpr
        <*> (rightParen <|> ("" <$ eof))
        <?> "parenthesized expression"

so that expressions with unbalanced parentheses can be parsed. Given ([2 it would produce

let inner = ExprInBrackets "[" (ExprNum 2) ""
in ExprInBrackets "(" inner ""

cc @artemohanjanyan

Control.Applicative.optional not working

Hei Olle!

It drove me crazy to remove left recursion in Parsec. Therefore a big thank you for writing Earley.
Maybe it will become a thing in the future. ;)

I am trying to parse a toy programming language but I had some problems with "optional" from Control.Applicative.

As an example, I want to parse an optional "a" before the letter "b".

{-# LANGUAGE RecursiveDo #-}
import Text.Earley
import Control.Applicative

grammar = mdo
    test <- rule $ (,) <$> (optional $ namedSymbol 'a') <*> namedSymbol 'b'
    return test

parse xs = fullParses $ parser grammar xs

Now parse "b" results in a Report with expected "a", unconsumed "b". But my desired result would be (Nothing, 'b').

A test suite is needed to prevent breaking changes

Extensible languages?

Hi,

I know this might be impossible with this library directly, but I was wondering how to allow the user to declare their own mixfix identifiers in the document being parsed. In parsec, or something, because the parsers are monadic I can simply keep updating my table as I parse more information, but as this library is strictly only for context-free languages, it makes this approach not feasible.

I guess I'd have to partially parse the file and grab all the declarations first, and then parse the document properly with earley? Is there an easier way?

Capture result as well as matched tokens

It would be nice to be able to capture the result of a production as well as the text matched by it, for example:

match :: Prod r e t a -> Prod r e t ([t], a)

Behavior of Semigroup/Monoid instances for Prod?

I noticed that the Monoid and Semigroup instances for Prod duplicate the Alternative instance. I found that unexpected. I was expecting them to lift the element instances over the Applicative instance, a la (<>) = liftA2 (<>). That's the behavior of those instances in many other parsing libraries.

Changing the instances now would be awkward, of course. People might not see the change and think that just because their grammars still compile, the code still is correct. I'm not really asking for a change. Just wondering why this design was chosen, and suggesting that a change is something you might consider if you ever make large-scale changes that would break things in other ways.

Suggestion: Make it so that `<$` rules return at most one result.

In my grammars, sometimes I'll having something along the lines of many (whitespace) *> optional (stuff) <* many (whitespace). The issue then is that if I am parsing, say, 4 spaces, many identical Nothing results are returned, based on the different ways the whitespace on the left and right could be matched. One solution is to rewrite it as many (whitespace) *> optional (stuff <* many (whitespace)), but when you have complicated grammar its hard to coordinate this.

I had an idea for how to modify the library to accommodate this problem. Basically, a new kind of production rule that returns at most one result. Which result? Well, the idea is that it would only work on rules such that all their results are identical. That is, they would only work on rules of the form a <$ r. That way, it does not matter what result is returned. Here are some examples of equations they would follow:

pure a = a <$ pure ()
fmap f (a <$ r) = (f a) <$ r
(f <$ r1) <*> (a <$ r2) = f a <$ (r1 *> r2)
r1 <* r2 = r1 <* (() <$ r2)
etc...

Then, in our many (whitespace) *> optional (stuff) <* many (whitespace) example, many (whitespace) *> pure Nothing <* many (whitespace) would reduce to Nothing <$ (many (whitespace) *> pure () <* many (whitespace)). Since this is what the 4 spaces are matching against, at most one Nothing result would be returned.

I was thinking about digging into the code and creating a pull request, but before that I wanted to see if it was seen as worthwhile, or if I was missing something obvious.

feature: constraint validation of production parse result

Hello.

I did not find how to implement some sort of constraints on production. For example if I wrote a rule to parse a series of digits and wanted to validate that if falls into a specific range, what would I need to do? Of course I can check the values after the parser ends, but the context is lost in such case like position where error occurs. I see no functions in current interface I can use to do that and also I have not find something similar in the examples.
It probably could be something function like:

validate :: (a -> Bool) -> Prod r e t a -> Prod r e t a

validate :: (a -> Maybe b) -> Prod r e t a -> Prod r e t b

so when the predicate fails or returns Nothing, the parser fails for that production. And we can use it like:

num <- rule $ validate (<= 1234) $ read <$> some digit

Or may be some sort of extended rule function with validation:

ruleValidate :: (a -> Bool) -> Prod r e t a -> Grammar r (Prod r e t a)
...
num <- ruleValidate (<= 1234) $ read <$> some digit

rebuild the documentation with --hyperlinked-source

the feature is only available in haddock's master branch. for example, you can run:

cabal haddock --with-haddock="$HOME/haddock/.cabal-sandbox/bin/haddock" --haddock-options="--hyperlinked-source"

after building haddock from source, in a sandbox (https://github.com/haskell/haddock):

git clone https://github.com/haskell/haddock.git
cd haddock                                       # e.g. ~/haddock
cabal sandbox init
cabal sandbox add-source haddock-library
cabal sandbox add-source haddock-api
cabal install haddock-library haddock-api
cabal install -j4 --dependencies-only --enable-tests
cabal configure --enable-tests
cabal build -j4

or for now, you could just reupload the package with the dist/doc/ I built. I can send it to you if you send me an e-mail at redacted

(you don't really have to, but I've been using it to explore the code locally, and others might find it helpful too)

Report doesn't include enough information

I was working on AoC problem (https://adventofcode.com/2021/day/10) recently and saw that recognizing whether an input matches a grammar was part of it. (I ended up going a different direction to solve it, but...)

Since I thought I only needed to identify whether something parsed or not, I wanted to use report to test acceptance without bothering with parse results. But.. that actually isn't sufficient. At least, not unless you add names to the productions, and in sufficient granularity. If you have skipped that portion for your grammar, the Report value doesn't give sufficient information to determine whether something parsed in its entirety - only whether it ran into an error or was able to consume the entire input without errors. The only way to determine if any parse ran to completion is checking whether fullParses returned any parses.

It would be really nice if report returned some way to determine whether there was a full parse without requiring sufficiently granular naming of productions.

Change type of `satisfy`

Would it be possible to change the type of satisfy to something like

satisfy :: (t -> Maybe a) -> Prod r e t a

This would make it easier to have a type like

module Token (Token, mkToken, unToken) where

newtype Token = Token { unToken :: Char }

mkToken :: Char -> Maybe Token
mkToken x
  | x >= 'a' || x <= 'z' = Just (Token x)
  | otherwise = Nothing

and have satisfy work as expected. (You can do this now, but with partial function nastiness.) Currently this is what Parsec's tokenPrim looks like.

list = listLike?

Wouldn't it be nicer to have list = listLike?

https://github.com/ollef/Earley/blob/master/Text/Earley/Derived.hs#L29

Please add tags "test ambiguity" etc

So far I was able to find one library on Hackage for testing ambiguity of arbitrary CFGs using brute force: "Earley". If you know any other library, please, let me know.

Unfortunately I discovered this library too late, so I had to roll out my own ugly solution for ambiguity checking. So, please, make this library discoverable. I want this library to show up if I type "test ambiguity" and similar queries in http://hackage.haskell.org/packages/search . As well as I understand you need to add tags to do this. So, please, add following tags: "test ambiguity of CFG", "test ambiguity of context-free grammar", "check ambiguity of CFG" etc. And mention this feature in README

add IsString instance

like:

instance (IsString t, Eq t, a ~ t) => IsString (Prod r e t a) where
 fromString s = Terminal (== t) (pure id)

and/or even:

instance (IsString t) => IsList (Prod r e t a) where
 type Item (Prod r e t a) = Prod r e t a
 fromList ps = Alts ps (pure id)
 toList = \case
  Alts ps f -> fmap (f <*>) ps
  p -> [p]

(away from my computer, haven't type checked)

The IsString instance increases the readability of the DSL, has a single simple implemention (and thus avoids forcing orphans on users), and doesn't confuse user when they don't enable OverloadStrings.

Something like Ruby Slippers parsing

Terminology from the Marpa parser.

I have no idea what Marpa's implementation is, but perhaps this would work for this package:

In the same way that names are specified for tokens at specific positions, we could maintain a set of slippers, which are utilized when no terminals match in order to rescue the parse.

Adding another choice with <|> causes parse to fail

I'm writing a parser to parse either 0) a sequence of digits with no grouping characters, or 1) a sequence of digits with grouping characters like commas.

I'm getting some behavior I wouldn't expect, but I'm too ignorant to know if this is a bug. I have tried to boil this down to a simple example, as follows:

{-# LANGUAGE RecursiveDo, RankNTypes #-}
module Main where

import Control.Applicative
import Text.Earley
import System.Environment

data Zero = Zero deriving (Eq, Show)
data Comma = Comma deriving (Eq, Show)

list :: Prod r e t a -> Grammar r (Prod r e t ([a]))
list p = mdo
  r <- rule $ pure [] <|> (:) <$> p <*> r
  return r

comma :: Prod r String Char Comma
comma = Comma <$ symbol ',' <?> ","

zero :: Prod r String Char Zero
zero = Zero <$ symbol '0' <?> "0"

data Grouped = Grouped Zero [Zero] Comma Zero [(Comma, Zero, [Zero])]
  deriving (Eq, Show)

data Ungrouped = Ungrouped Zero [Zero] deriving (Eq, Show)

groupedOrUngrouped :: Grammar r (Prod r String Char (Either Grouped Ungrouped))
groupedOrUngrouped = mdo
  seqZero <- list zero
  seqGroups <- list $ (,,) <$> comma <*> zero <*> seqZero
  let l = Grouped <$> zero <*> seqZero <*> comma <*> zero <*> seqGroups
      r = Ungrouped <$> zero <*> seqZero
  rule $ Left <$> l <|> Right <$> r

grouped :: Grammar r (Prod r String Char Grouped)
grouped = mdo
  seqZero <- list zero
  seqGroups <- list $ (,,) <$> comma <*> zero <*> seqZero
  rule $ Grouped <$> zero <*> seqZero <*> comma <*> zero <*> seqGroups

main :: IO ()
main = do
  arg:[] <- getArgs
  print (allParses (parser groupedOrUngrouped) arg)
  putStrLn ""
  print (allParses (parser grouped) arg)

Here is some test output:

$ ./earleytest.hs 0,0
([(Right (Ungrouped Zero []),1)],Report {position = 1, expected = ["0"], unconsumed = ",0"})

([(Grouped Zero [] Comma Zero [],3)],Report {position = 3, expected = [","], unconsumed = ""})

I would expect that a Grouped value would appear when using the groupedOrUngrouped parser. I don't get a Grouped value there, but I do get one when using the grouped parser alone. Why? Any help is much appreciated. stack enabled source is also at

https://gist.github.com/massysett/b132c660c34955533795

Wrong(?) result on some grammars with infinite results

grammar = mdo
  r <- rule $ (pure 1 <|> ((+1) <$> r))
  pure r

produces just 1, not [1,2,3...]

I don't think it's possible to have this work and

grammar = mdo
  r <- rule $ (pure 1 <|> r)
  pure r

not produce an infinite list of results

I think this is a sensible choice because it currently can't deal with infinite results, but it should be documented somewhere. Theoretically it should be possible to return an infinite lazy list.

Odd mixfix behavior

Just curious what you think about this behavior:

When defining an operator if _ _ with right associativity, then

if if x y z

will be successfully parsed as

if (if x y) z

but

if x if y z

will fail to parse, requiring explicit parentheses.

Doesn't this seem backwards? I would expect the above behavior if the operator had left associativity.

So, does associativity even make sense for operators that are not of the form _a_b_c_d_? Are two Nothings in a row when defining a mixfix operator supposed to be supported?

Parsing hangs even for simple ambiguous grammars

Consider the following toy example:

import Control.Applicative
import Text.Earley

data S = S [Char] Char [T] deriving Show
data T = T Char S deriving Show

a, b, c :: Prod r String Char Char
a = satisfy (=='a')
b = satisfy (=='b')
c = satisfy (=='c')

-- "abcb" -> S [a] b [T c (S [] b [])]
-- "abc" -> last 'c' should be unused
-- "abcc" -> parsing all of it without leaving unconsumed pieces should not be possible
-- "abcbcab" -> S [a] b [T c (S [] b [T c (S [a] b)])] OR S [a] b [T c (S [] b []), T c (S [a] b [])]
--s = S <$> many a <*> b <*> many t -- TODO
s = S <$> many a <*> b <*> pure []
t = T <$> c <*> s

main = print (report p "abc") >> print (fullParses p "abc")
    where p = parser $ rule s

As far as I can tell, this grammar should be parsable. Every string only has a finite number of possible parse trees.
Yet parsing fails if I enable the commented line (marked with a TODO).
Am I doing something wrong? Or is this a bug/limitation in Earley?

Kind error trying examples/expr.hs at GHCi (7.10.2 with Stack)

Trying out the mentioned example, I encountered this error in stack lts-3.15 GHC 7.10.2 using GHCi:

Expecting one more argument to ‘Grammar r (Prod r String String Expr)’
Expected a type,
  but ‘Grammar r (Prod r String String Expr)’ has kind ‘* -> *’
In the type signature for ‘expr’:
  expr :: Grammar r (Prod r String String Expr)

Sorry to ask here, I'm assuming it's something I've done wrong, but I'm not sure what it is.

I made sure to copy the example verbatim from the github raw representation. Any help would be greatly appreciated!

Missing case for Constraint for fmap and <*>

Choice with preference

It would be nice to be able to say a otherwise b such that only if a fails, b appears in the output.

Early seems to do an ad-hoc version of this for (Just <$> a) <|> pure Nothing, which returns [Just a] if a succeeds, not the expected [Just a, Nothing].

Generalize `word`?

Should word be generalized to

word :: (Eq t, ListLike i t) => i -> Prod r e t i

instead of

word :: Eq t => [t] -> Prod r e t [t]

?
This would make it easier to use values of type Text in particular.

Is it okey to use "many" and heavily use Applicative parsing techniques?

Is it normal to use Alternative's many? For example, x1 <- rule $ many x2? Is it normal to heavily use techniques known from Applicative parsing world, for example, to write something like the following code?

{-# LANGUAGE RecursiveDo #-}
import Control.Applicative
import Data.Char
import System.Environment
import Text.Earley

data Expr
  = Add Expr Expr
  | Mul Expr Expr
  | Var String
  deriving (Eq, Ord, Show)

sepBy :: Alternative f => f a -> f b -> f [b]
sepBy op x = (:) <$> x <*> many (op *> x)

expr :: Grammar r (Prod r String String Expr)
expr = mdo
  x1 <- rule $ foldl1 Add <$> sepBy (namedToken "+") (foldl1 Mul <$> sepBy (namedToken "*") ((Var <$> satisfy ident) <|> namedToken "(" *> x1 <* namedToken ")"))
  return x1
  where
    ident (x:_) = isAlpha x
    ident _     = False

main :: IO ()
main = do
  x:_ <- getArgs
  print $ fullParses (parser expr) $ words x

The code above is modified version of https://github.com/ollef/Earley/blob/e7de8b77b63e1ee889d98cb6dfc2994178f23e6d/examples/Expr.hs and I hope they are semantically equivalent. I tested my version and it works.

My question is not about performance. It is about semantics. I. e. is code above correct or it works simply by chance?

Crashes if grammar is infinitely ambiguous on given input

Consider this code:

{-# LANGUAGE Haskell2010 #-}
{-# LANGUAGE RecursiveDo #-}

import Text.Earley
import Control.Applicative

data L = A | B | C deriving (Eq, Show)

lang :: Grammar r (Prod r () L ())
lang = mdo {
  r1 <- rule $ (pure () <* token A) <|> r1;
  return r1;
}

main :: IO ()
main = do {
  putStrLn $ show $ fullParses (parser lang) [A];
}

This program crashes with stack overflow if I run it in ghci. And freezes if I compile and run it. I cannot even test whether list of parses is null, i. e. null $ fst $ fullParses (parser lang) [A] crashes in ghci.

I want some function, which can test whether given token list has multiple parses. I. e. I don't want list of parses or even count of parses. I simply want one of 3 results: 0 parses, 1 parse, more than 1 parse. And this function should terminate in case of infinitely ambiguous grammar.

I use Earley 0.13.0.1

Q: How to do error recovery?

Are there any recommendations on how I could do error recovery to report multiple errors?

Disambiguate/Filter productions

Once the parser finishes parsing some ambiguity, it would be really nice to expose this ambiguity to the user, rather than just multiplying the output by this ambiguity size.

For example disambiguate :: ([a] -> b) -> Prod a -> Prod b or more generally ([a] -> [b]) -> Prod a -> Prod b.

I had a look at naively changing some things in parse and following the types but was unable to get this to work without better understanding the algorithm.

The discussion here makes it seem as though this is actually non-trivial! #18

Perhaps some important reading: https://dinhe.net/~aredridel/.notmine/PDFs/Parsing/SCOTT%2C%20Elizabeth%20-%20SPPF-Style%20Parsing%20From%20Earley%20Recognizers.pdf

Use Data.Loc rather than Int

Would it be possible to tag parsed items with a Data.Loc.L rather than an Int? Having both the column and row number is very useful.

More specifically,

allParses :: (forall s. ST s (Result s e i a)) -> ([(a, Int)], Report e i)

would become

allParses :: (forall s. ST s (Result s e i a)) -> ([L a], Report e i)

Currently, I'm lexing with https://hackage.haskell.org/package/lexer-applicative, which returns a stream of L-tagged lexemes anyway, and then I'm awkwardly unpacking and re-packing the locations into the parsed items. Here's a small example:

-- Token
data Token 
    = TkInt Int
    | ...

-- Lexer
lexer :: String -> [L Token]

-- Syntax
data Expr 
    = EInt Int
    ...

-- Parser
expr :: Prod r String (L Token) (L Expr)
expr = (\L loc n -> L loc (EInt n)) <$> intLit
   <|> ...

intLit :: Prod r String (L Token) (L Int)
intLit = (\L loc (TkInt n) -> L loc n) <$> satisfy isIntLit <?> "int"
  where
    isIntLit (L _ (TkInt _)) = True
    isIntLit _ = False

(apologies if the above doesn't typecheck, I just wrote it by hand)

If Earley annotated parsed items with L then I could simply forget all the work that the lexer did by mapping unLoc over the stream of tokens, so I'd have productions like Prod r String Token Expr rather than Prod r String (L Token) (L Expr). I wonder if there's a better way to integrate these two libraries.

Parsing into a list delimited by a token

I'm a very novice Haskell programmer and I wanted to write something with parsers. My current solution for parsing abc.def.ghi uses the split package like so:

dparse :: (String -> Bool) -> Prod r String String [String]
dparse f = splitOn "." <$> satisfy f

I feel like, from my Scala+cats knowledge, that this is a very poor way to do this. How should I properly implement it?

Generate ABNF/EBNF from Earley grammar

Any idea how hard this would be to do? If Earley could generate EBNF then users could pass it to tools like GrammKit to make nice diagrams for their grammar.

Diverges with sepEndBy from parser-combinator

Hello. Thank you for this great library.

I think I found a bug. The grammar below diverges if I use default sepEndBy from parser-combinators:

-- Does not work, but sepEndBy only requires an Applicative instance and we have it
grammar' = rule $ sepEndBy (satisfy (== OpenBracket)) (satisfy (== Dot))

-- but with this one everything is ok
sepEndBy :: Prod r e t a -> Prod r e t b -> Prod r e t [a]
-- btw sepBy can come from parser-combinators, and still no problem here
sepEndBy p sep = sepBy p sep <* optional sep

Unusable {-# UNPACK #-} pragmas.

[2 of 4] Compiling Text.Earley.Parser ( Text/Earley/Parser.hs, dist/build/Text/Earley/Parser.o )

Text/Earley/Parser.hs:83:3: Warning:
    Ignoring unusable UNPACK pragma on the third argument of ‘State’
    In the definition of data constructor ‘State’
    In the data declaration for ‘State’

Text/Earley/Parser.hs:92:3: Warning:
    Ignoring unusable UNPACK pragma on the second argument of ‘Cont’
    In the definition of data constructor ‘Cont’
    In the data declaration for ‘Cont’

Text/Earley/Parser.hs:92:3: Warning:
    Ignoring unusable UNPACK pragma on the fourth argument of ‘Cont’
    In the definition of data constructor ‘Cont’
    In the data declaration for ‘Cont’

In all cases, those arguments are Arg values, and Arg is a type alias for a function type. Function types can't be unpacked, since they aren't thin wrappers around primitives.

Is Earley’s implementation related to Marpa?

The author of Marpa claims it to be an improvement over Joop Leo’s ’91 paper.

Has Marpa influenced this library’s implementation?

out-of-order results

below, edit has nullable rules, and insertion has a wildcard. the parse results are all present, but some alternatives that appear before other alternatives ("lexically" in the grammar), appear later in the results.

e.g. given (full script is below):

          Editing <$> edit
    <|> Insertion <$> insertion

we get:

["whole","line"]
Insertion ["whole","line"]
Editing (Edit Copy Whole Line)

rather than:

["whole","line"]
Editing (Edit Copy Whole Line)
Insertion ["whole","line"]

I don't know if early parsers specify the order of the results for an ambiguous grammar, but it's how other parser libraries behave. and either way, would be nice to have: my hack is to postprocess the set of results, and rank them. this is messier (as ranking can be "nonlocal"), and maybe slower, than the first result being the expected one.

to reproduce:

import Control.Applicative
import Text.Earley

data Edit   = Edit Action Slice Region     deriving (Show)
data Action = Copy | Delete | Cut          deriving (Show)
data Slice  = Whole | Backwards | Forwards deriving (Show)
data Region = Char | Word | Line           deriving (Show)

type Text = String

edit :: Grammar r (Prod r String Text Edit) 
edit = do

 action <- rule $ "action"
   <=> Copy        <$ symbol "copy"
   <|> Delete      <$ symbol "delete"
   <|> Cut         <$ symbol "cut"

 slice <- rule $ "slice"
   <=> Whole     <$ symbol "whole"
   <|> Backwards <$ symbol "backwards"
   <|> Forwards  <$ symbol "forwards"

 region <- rule $ "region"
   <=> Char <$ symbol "char"
   <|> Word <$ symbol "word"
   <|> Line <$ symbol "line"

 edit' <- rule $ "edit"
   <=> Edit <$> action            <*> (slice -?- Whole) <*> (region -?- Line)
   <|> Edit <$> (action -?- Copy) <*> slice             <*> region
   -- <=> Edit <$> action <*> slice <*> region

 return edit'

data Command
 = Editing Edit
 | Insertion [Text]
 deriving (Show)

command :: Grammar r (Prod r String Text Command) 
command = do

  edit' <- edit

  insertion' <- rule $ some anyWord

  command' <- rule $ "command"
    <=> Editing <$> edit'
    <|> Insertion <$> insertion'

  return command'

anyWord :: Prod z String Text Text
anyWord = satisfy (const True) 

parseCommand :: [Text] -> ([Command], Report String [Text])
parseCommand s = fullParses (parser command) s 

(<=>) :: e -> Prod r e t a -> Prod r e t a 
(<=>) = flip (<?>)
infix 2 <=> 

(-?-) :: Prod r e t a -> a -> Prod r e t a 
(-?-) p x = maybe x id <$> optional p 

printParses :: (Show a, Show e, Show ts) => (ts -> ([a], Report e ts)) -> ts -> IO ()
printParses theParser theSentence = do 
 let (theResults, theReport) = theParser theSentence 
 putStrLn"" 
 print theSentence 
 _ <- traverse print theResults 
 print theReport 

main :: IO () 
main = do

 putStrLn"" 
 printParses parseCommand (words "line")
 printParses parseCommand (words "copy")
 printParses parseCommand (words "copy whole")
 printParses parseCommand (words "whole line")
 printParses parseCommand (words "copy whole line")
 printParses parseCommand (words "not an edit")

{- outputs: 

["line"]
Insertion ["line"]
Report {position = 1, expected = [], unconsumed = []}

["copy"]
Editing (Edit Copy Whole Line)
Insertion ["copy"]
Report {position = 1, expected = ["region","slice"], unconsumed = []}

["copy","whole"]
Insertion ["copy","whole"]
Editing (Edit Copy Whole Line)
Report {position = 2, expected = ["region"], unconsumed = []}

["whole","line"]
Insertion ["whole","line"]
Editing (Edit Copy Whole Line)
Report {position = 2, expected = [], unconsumed = []}

["copy","whole","line"]
Editing (Edit Copy Whole Line)
Editing (Edit Copy Whole Line)
Insertion ["copy","whole","line"]
Report {position = 3, expected = [], unconsumed = []}

["not","an","edit"]
Insertion ["not","an","edit"]
Report {position = 3, expected = [], unconsumed = []}

-}

Make the Mixfix example part of the library

Similar to Parsec's makeTokenParser.

There might also be other utilities in existing parsing libraries that could be added to Earley.

Is there any possible optimization for getting non-exhaustive outputs for ambiguous grammars

I'm parsing an ambiguous grammar and I'd like to distinguish the following cases (more quickly than dealing with a list of all results):

Failure to parse
Exactly one parse with result
Multiple parses (potentially with some small number of valid parses returned)

The use case is being able to tell the user: "Hey, you gave me ambiguous input, here are some places to check for ambiguity". Giving them 20000 results isn't helpful, especially as any large number is bound to be the result of some combinatorial explosion, and solving one ambiguity makes good progress.

I had a quick go at modifying fullParses in a few ways to prompt an earlier return but there wasn't a quick "no-understanding" fix.

document implementation

I want to contribute a few ideas I have for new features. but after reading the source, I still can't understand how parse works.

in http://www.nltk.org/book/ch08-extras.html (2 Chart Parsing and 2.7 The Earley Algorithm), the concepts seem to be:

edges between tokens, complete or non-complete
rules, which add new edges given the current edges
a chart, a structure that holds edges
strategies, this set of rules that differentiate chart parsers from each other

but I don't see these in the types (State/Cont/Conts/Result). if they're not really there, how should I think about the types? is there a blog post or paper I could read that more closely reflects the core ideas behind this package?

also, it helps to know about:

any patterns this package uses, or tricks. e.g. why the nested reference in Conts:

     contsArgs :: !(STRef s (Maybe (STRef s (Results s a))))

is it for performance, for correctness? Everything is there for a reason, but it's hard to read for a noob like me.

thanks :-)

expose internals

a user might want to observe sharing themselves (e.g. manual annotations, http://hackage.haskell.org/package/data-reify, their own monad, etc.), and pass it down to the parser.

We can put it in a .Internal module.