I want to contribute a few ideas I have for new features. but after reading the source

I've started doing this. <a href="https://github.com/ollef/Earley/blob/master/docs/imp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

document implementation,about ollef/earley

Comments (10)

ollef commented on September 23, 2024

I'm aware that the internals aren't exactly easy to follow, and I've been meaning to write something up about it. This issue just might be the push I need to actually do it. :)

Some quick remarks about the things you mentioned:
Both State and Cont represent Earley states. They differ in the types, since we're in a typed setting and are constructing the results as we parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a reason. I'll start with the one in Rule:
We only want a rule to be expanded once per position in the input. In traditional Earley parsing this is implicit in the usage of sets of states, since adding duplicates is then a no-op. We can't hold sets of states because states contain things of unknown type and are thus not in Eq or Ord.

So we instead make sure to only expand each rule once per position, which has the same effect. When a rule is expanded, we also have to keep track of a continuation, i.e. the productions to parse after a successful parse of the rule. But since we only expand a rule once per position, what happens when we encounter a rule a second time at the same position? This is where these pointers come in. At this point we look up its the rule's continuation pointer (i.e. dereference the outer pointer layer), and add ourselves to the list of continuations (i.e. update the inner pointer). Now any states (e.g. from the first time the rule was expanded) that refer to this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state that branches into two states. In this case both branches will have a continuation that is the same (e.g. c in (a <|> b) <*> c). If both branches produce a result at the same position we want to merge the branches --- otherwise we would suddenly have several states at the same position (at c in the example) and parsing would become exponential. This is what the contsArgs field that you referred to is for, in a way roughly the same as that for rules. The inner reference gives us a way to inject more results into the continuation state even when that state has already been processed at a position.

from earley.

sboosali commented on September 23, 2024

oh yeah that helps.

feel free to link me to any drafts. sooner or later I'll learn how it works
(it's just a few lines too). until then, I can provide "fresh eyes".

On Thursday, September 17, 2015, Olle Fredriksson [email protected]
wrote:

I'm aware that the internals aren't exactly easy to follow, and I've been
meaning to write something up about it. This issue just might be the push I
need to actually do it. :)

Some quick remarks about the things you mentioned:
Both State and Cont represent Earley states. They differ in the types,
since we're in a typed setting and are constructing the results as we
parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a
reason. I'll start with the one in Rule:
We only want a rule to be expanded once per position in the input. In
traditional Earley parsing this is implicit in the usage of sets of states,
since adding duplicates is then a no-op. We can't hold sets of states
because states contain things of unknown type and are thus not in Eq or
Ord.

So we instead make sure to only expand each rule once per position, which
has the same effect. When a rule is expanded, we also have to keep track of
a continuation, i.e. the productions to parse after a successful parse of
the rule. But since we only expand a rule once per position, what happens
when we encounter a rule a second time at the same position? This is where
these pointers come in. At this point we look up its the rule's
continuation pointer (i.e. dereference the outer pointer layer), and add
ourselves to the list of continuations (i.e. update the inner pointer). Now
any states (e.g. from the first time the rule was expanded) that refer to
this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a
fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state
that branches into two states. In this case both branches will have a
continuation that is the same (e.g. c in (a <|> b) <*> c). If both
branches produce a result at the same position we want to merge the
branches --- otherwise we would suddenly have several states at the same
position (at c in the example) and parsing would become exponential. This
is what the contsArgs field that you referred to is for, in a way roughly
the same as that for rules. The inner reference gives us a way to inject
more results into the continuation state even when that state has already
been processed at a position.

—
Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

ollef commented on September 23, 2024

I've started doing this. https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty information in there.

from earley.

sboosali commented on September 23, 2024

this is amazing!

On Wednesday, September 23, 2015, Olle Fredriksson [email protected]
wrote:

I've started doing this.
https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are
mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty
information in there.

—
Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

greglearns commented on September 23, 2024

The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)

from earley.

sboosali commented on September 23, 2024

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards [email protected] wrote:

The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the examples,
but don't really explain what the different symbols are used for or how to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way, for
your work so far!)

—
Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

greglearns commented on September 23, 2024

I've done parsing in Ruby, Java, and C -- I'm pretty new to Haskell (sorry!
I know that complicates things)

Your offer to help is very kind. I'd be happy to help out in terms of
documentation if you wish once I understand things better, so that I can
give back.

Here's an example of a line I'd like to parse, and the resulting structure
I'd like to produce (ultimately as JSON)

  input = " .. [next x] some note #other #for:bob, mary and (betty or

larry) /6m/12/40h"

  var expected = {
    numberOfInitialDots: 2,
    boxed_word: 'next',
    boxed_letter: 'x',
    text: 'some note',
    tags: {
      other: '',
      for: 'bob, mary and (betty or larry)'
    },
    numbers: [
      { value: 6, letter: 'm' },
      { value: 12, letter: '' },
      { value: 40, letter: 'h' }
    ]
  }

On Sat, Oct 3, 2015 at 9:43 PM, Sam Boosalis [email protected]
wrote:

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards [email protected]
wrote:

The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the
examples,
but don't really explain what the different symbols are used for or how
to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way,
for
your work so far!)

—
Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

—
Reply to this email directly or view it on GitHub
#12 (comment).

from earley.

sboosali commented on September 23, 2024

cool.

here's a tutorial for parsing log files with more explanation, using a different parser combinator library:

https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/attoparsec

(I think we should link to a general parser combinator tutorials from the readme).

for your problem, the "Haskell way" is first to define a custom type. we construct this type as soon as possible whenever consuming data, from a parser or a database or whatever, and then printing/serializing that data back again at the very end. the benefits being that it's much easier manipulate structured data then raw text, and also easier to manipulate typed data than json.

{-# LANGUAGE DerivingGeneric, DeriveAnyClass  #-}

import Data.Aeson (ToJSON)  -- from the `aeson` package 

data GregLine = 
 { numberOfInitialDots :: Int
 , boxed_word          :: String
 , boxed_letter        :: Char 
 , text                :: String
 , tags                :: [Tag] 
 , numbers             :: [Number] 
 } deriving (Generic, ToJSON) -- this one line lets us serialize our datatype into json with `encode`.   

type Tag = (String, Maybe String)   -- the Maybe provides explicit optionality...

type Number = (Int, Maybe Char)     -- ...and will also simplify the parsers.

Haskell has many "combinator libraries". in particular, "parser combinator libraries" like this one. the basic idea is that the library exports a type (call it Combinator), some "primitives" that construct values of that type (like aPrimitive :: String -> Combinator), and "combinators" that transform values of that type (like aCombinator :: Integer -> Combinator -> Combinator). if we wanted to build a simple "arithmetic combinator library" (sounds cooler than it is), the primitives might be 0 and 1, and the combinators might be + and *.

so first, let's import the primitives from Text.Earley and the combinators from Control.Applicative (the latter is in the standard library, yay code reuse!)

import Text.Earley
import [Control.Applicative](https://hackage.haskell.org/package/base-4.8.1.0/docs/Control-Applicative.html) (many, some, optional, (<$>), (<*>), (<*), (*>))

if we specialize the type signatures to Prod, we get:

optional :: Prod r e t a -> Prod r e t (Maybe a)
many     :: Prod r e t a -> Prod r e t [a]
some     :: Prod r e t a -> Prod r e t [a]
(<*)     :: Prod r e t a -> Prod r e t b -> Prod r e t a
(*>)     :: Prod r e t b -> Prod r e t a -> Prod r e t a

we can read them as:

optional  -- zero or one, or regex "?"
many      -- zero or more, or regex "*"  
some      -- one or more, or regex "+"
(<*)      -- parse the left and keep its result, then parse the right and ignore its result 
(*>)      -- parse the left and ignore its result, then parse the right and keep its result

we can read the type constructor Prod r e t as "consumes some tokens and then returns", so an expression of type pInt :: Prod r e t Int means that the expression "consumes some tokens and then returns an int". thus, if we wanted to increment the result of a parse, we couldn't write (+1) pInt since that was incrementing the parser, not the number parsed. we would write (+1) <$> pInt.

anyways, following is the basic scaffold for your problem, using the library. it reads like BNF, hopefully. or more like yacc, where you can insert "interpretations" around productions.

(the g stands for "grammar" and the p stands for "parser". I haven't tried type checking this)

-- | can parse @" .. [next x] some note #other #for:bob, mary and (betty or larry) /6m/12/40h"@
gGregLine :: Grammar r (Prod r e Char GregLine)                  -- the tokens are characters
gGregLine = do
 pWhitespace          <- rule$ some (symbol ' ')
 pNumberOfInitialDots <- rule$ length <$> some (symbol '.')      -- length "interprets" the parse result
                                                                 -- of "one or more" dots
 pBoxed_word          <- rule$ _                                 -- uses <* and *>
 pBoxed_letter        <- rule$ _
 pText                <- rule$ _
 pTags                <- rule$ _                                 -- uses optional
 pNumbers             <- rule$ _            
 pGregLine            <- rule$ GregLine <$> (pWhitespace *> pNumberOfInitialDots)  -- skip initial whitespace 
                                        <*> pBoxed_word
                                        <*> pBoxed_letter
                                        <*> pText
                                        <*> pTags
                                        <*> pNumbers

 return pGregLine

let me know if you're unfamiliar with "Applicative"s. if so, for now, you can think of f <$> x <*> y as similar to normal pure function application f x y, only "with effects". here, the "effects" are "matching tokens and parsing them into some value".

and you can ignore rule for now, it just says it's argument is a grammatical production with the left hand side, rather than just a grammatical expression with only a right-hand side.

(:: Prod r e Char Char)
----
|  |
pInt <- rule$ read <$> satisfy isDigit
        |     |        |_____________| 
        |     |        (:: Prod r e Char Char)
        |     |______________________| 
        |     (:: Prod r e Char Int) |
        |                            | 
        |                            |
        |____________________________| 
        (:: Grammar r (Prod r e Char Int))

by the way, how did you come across this package?

and let me know if you need more info.

from earley.

ollef commented on September 23, 2024

@sboosali: I've added you as a collaborator to the repository so you can add this to the documentation if you're up for it. I suggest adding it to the docs directory and linking it somewhere appropriate in the README. I can help flesh it out when I find the time. A minor issue from a quick look is that the types of a few of the combinators are incorrect: the list and maybe types should be "pushed into" the return types.

from earley.

sboosali commented on September 23, 2024

thanks, will do (the type typo is noted).

also, I'll try to clearly separate "new to parser combinators" from "new to this package".

from earley.

document implementation about earley HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent