Comments (10)
I'm aware that the internals aren't exactly easy to follow, and I've been meaning to write something up about it. This issue just might be the push I need to actually do it. :)
Some quick remarks about the things you mentioned:
Both State
and Cont
represent Earley states. They differ in the types, since we're in a typed setting and are constructing the results as we parse. Cont
needs an argument, i.e. a parse result, to become a State
.
The double references look a bit peculiar in Haskell, but are there for a reason. I'll start with the one in Rule
:
We only want a rule to be expanded once per position in the input. In traditional Earley parsing this is implicit in the usage of sets of states, since adding duplicates is then a no-op. We can't hold sets of states because states contain things of unknown type and are thus not in Eq
or Ord
.
So we instead make sure to only expand each rule once per position, which has the same effect. When a rule is expanded, we also have to keep track of a continuation, i.e. the productions to parse after a successful parse of the rule. But since we only expand a rule once per position, what happens when we encounter a rule a second time at the same position? This is where these pointers come in. At this point we look up its the rule's continuation pointer (i.e. dereference the outer pointer layer), and add ourselves to the list of continuations (i.e. update the inner pointer). Now any states (e.g. from the first time the rule was expanded) that refer to this continuation (the inner pointer) are automatically updated!
At each position we have to make sure that the outer reference points to a fresh continuation pointer. This is what the reset
computation does.
We have a similar situation when there are alternatives, i.e. one state that branches into two states. In this case both branches will have a continuation that is the same (e.g. c
in (a <|> b) <*> c
). If both branches produce a result at the same position we want to merge the branches --- otherwise we would suddenly have several states at the same position (at c
in the example) and parsing would become exponential. This is what the contsArgs
field that you referred to is for, in a way roughly the same as that for rules. The inner reference gives us a way to inject more results into the continuation state even when that state has already been processed at a position.
from earley.
oh yeah that helps.
feel free to link me to any drafts. sooner or later I'll learn how it works
(it's just a few lines too). until then, I can provide "fresh eyes".
On Thursday, September 17, 2015, Olle Fredriksson [email protected]
wrote:
I'm aware that the internals aren't exactly easy to follow, and I've been
meaning to write something up about it. This issue just might be the push I
need to actually do it. :)Some quick remarks about the things you mentioned:
Both State and Cont represent Earley states. They differ in the types,
since we're in a typed setting and are constructing the results as we
parse. Cont needs an argument, i.e. a parse result, to become a State.The double references look a bit peculiar in Haskell, but are there for a
reason. I'll start with the one in Rule:
We only want a rule to be expanded once per position in the input. In
traditional Earley parsing this is implicit in the usage of sets of states,
since adding duplicates is then a no-op. We can't hold sets of states
because states contain things of unknown type and are thus not in Eq or
Ord.So we instead make sure to only expand each rule once per position, which
has the same effect. When a rule is expanded, we also have to keep track of
a continuation, i.e. the productions to parse after a successful parse of
the rule. But since we only expand a rule once per position, what happens
when we encounter a rule a second time at the same position? This is where
these pointers come in. At this point we look up its the rule's
continuation pointer (i.e. dereference the outer pointer layer), and add
ourselves to the list of continuations (i.e. update the inner pointer). Now
any states (e.g. from the first time the rule was expanded) that refer to
this continuation (the inner pointer) are automatically updated!At each position we have to make sure that the outer reference points to a
fresh continuation pointer. This is what the reset computation does.We have a similar situation when there are alternatives, i.e. one state
that branches into two states. In this case both branches will have a
continuation that is the same (e.g. c in (a <|> b) <*> c). If both
branches produce a result at the same position we want to merge the
branches --- otherwise we would suddenly have several states at the same
position (at c in the example) and parsing would become exponential. This
is what the contsArgs field that you referred to is for, in a way roughly
the same as that for rules. The inner reference gives us a way to inject
more results into the continuation state even when that state has already
been processed at a position.—
Reply to this email directly or view it on GitHub
#12 (comment).
(this message was composed with dictation: charitably interpret typos)Sam
Boosalis
from earley.
I've started doing this. https://github.com/ollef/Earley/blob/master/docs/implementation.md
Please let me know if anything needs clarification or if there are mistakes, typos, or whatever.
Most of it is background so far, but there is already some meaty information in there.
from earley.
this is amazing!
On Wednesday, September 23, 2015, Olle Fredriksson [email protected]
wrote:
I've started doing this.
https://github.com/ollef/Earley/blob/master/docs/implementation.mdPlease let me know if anything needs clarification or if there are
mistakes, typos, or whatever.Most of it is background so far, but there is already some meaty
information in there.—
Reply to this email directly or view it on GitHub
#12 (comment).
(this message was composed with dictation: charitably interpret typos)Sam
Boosalis
from earley.
The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)
from earley.
(this implementation file helps explain the internals only.)
have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.
On Saturday, October 3, 2015, Greg Edwards [email protected] wrote:
The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the examples,
but don't really explain what the different symbols are used for or how to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way, for
your work so far!)—
Reply to this email directly or view it on GitHub
#12 (comment).
(this message was composed with dictation: charitably interpret typos)Sam
Boosalis
from earley.
I've done parsing in Ruby, Java, and C -- I'm pretty new to Haskell (sorry!
I know that complicates things)
Your offer to help is very kind. I'd be happy to help out in terms of
documentation if you wish once I understand things better, so that I can
give back.
Here's an example of a line I'd like to parse, and the resulting structure
I'd like to produce (ultimately as JSON)
input = " .. [next x] some note #other #for:bob, mary and (betty or
larry) /6m/12/40h"
var expected = {
numberOfInitialDots: 2,
boxed_word: 'next',
boxed_letter: 'x',
text: 'some note',
tags: {
other: '',
for: 'bob, mary and (betty or larry)'
},
numbers: [
{ value: 6, letter: 'm' },
{ value: 12, letter: '' },
{ value: 40, letter: 'h' }
]
}
On Sat, Oct 3, 2015 at 9:43 PM, Sam Boosalis [email protected]
wrote:
(this implementation file helps explain the internals only.)
have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.On Saturday, October 3, 2015, Greg Edwards [email protected]
wrote:The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the
examples,
but don't really explain what the different symbols are used for or how
to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way,
for
your work so far!)—
Reply to this email directly or view it on GitHub
#12 (comment).(this message was composed with dictation: charitably interpret typos)Sam
Boosalis—
Reply to this email directly or view it on GitHub
#12 (comment).
from earley.
cool.
here's a tutorial for parsing log files with more explanation, using a different parser combinator library:
(I think we should link to a general parser combinator tutorials from the readme).
for your problem, the "Haskell way" is first to define a custom type. we construct this type as soon as possible whenever consuming data, from a parser or a database or whatever, and then printing/serializing that data back again at the very end. the benefits being that it's much easier manipulate structured data then raw text, and also easier to manipulate typed data than json.
{-# LANGUAGE DerivingGeneric, DeriveAnyClass #-}
import Data.Aeson (ToJSON) -- from the `aeson` package
data GregLine =
{ numberOfInitialDots :: Int
, boxed_word :: String
, boxed_letter :: Char
, text :: String
, tags :: [Tag]
, numbers :: [Number]
} deriving (Generic, ToJSON) -- this one line lets us serialize our datatype into json with `encode`.
type Tag = (String, Maybe String) -- the Maybe provides explicit optionality...
type Number = (Int, Maybe Char) -- ...and will also simplify the parsers.
Haskell has many "combinator libraries". in particular, "parser combinator libraries" like this one. the basic idea is that the library exports a type (call it Combinator
), some "primitives" that construct values of that type (like aPrimitive :: String -> Combinator
), and "combinators" that transform values of that type (like aCombinator :: Integer -> Combinator -> Combinator
). if we wanted to build a simple "arithmetic combinator library" (sounds cooler than it is), the primitives might be 0
and 1
, and the combinators might be +
and *
.
so first, let's import the primitives from Text.Earley
and the combinators from Control.Applicative
(the latter is in the standard library, yay code reuse!)
import Text.Earley
import [Control.Applicative](https://hackage.haskell.org/package/base-4.8.1.0/docs/Control-Applicative.html) (many, some, optional, (<$>), (<*>), (<*), (*>))
if we specialize the type signatures to Prod
, we get:
optional :: Prod r e t a -> Prod r e t (Maybe a)
many :: Prod r e t a -> Prod r e t [a]
some :: Prod r e t a -> Prod r e t [a]
(<*) :: Prod r e t a -> Prod r e t b -> Prod r e t a
(*>) :: Prod r e t b -> Prod r e t a -> Prod r e t a
we can read them as:
optional -- zero or one, or regex "?"
many -- zero or more, or regex "*"
some -- one or more, or regex "+"
(<*) -- parse the left and keep its result, then parse the right and ignore its result
(*>) -- parse the left and ignore its result, then parse the right and keep its result
we can read the type constructor Prod r e t
as "consumes some tokens and then returns", so an expression of type pInt :: Prod r e t Int
means that the expression "consumes some tokens and then returns an int". thus, if we wanted to increment the result of a parse, we couldn't write (+1) pInt
since that was incrementing the parser, not the number parsed. we would write (+1) <$> pInt
.
anyways, following is the basic scaffold for your problem, using the library. it reads like BNF, hopefully. or more like yacc, where you can insert "interpretations" around productions.
(the g
stands for "grammar" and the p
stands for "parser". I haven't tried type checking this)
-- | can parse @" .. [next x] some note #other #for:bob, mary and (betty or larry) /6m/12/40h"@
gGregLine :: Grammar r (Prod r e Char GregLine) -- the tokens are characters
gGregLine = do
pWhitespace <- rule$ some (symbol ' ')
pNumberOfInitialDots <- rule$ length <$> some (symbol '.') -- length "interprets" the parse result
-- of "one or more" dots
pBoxed_word <- rule$ _ -- uses <* and *>
pBoxed_letter <- rule$ _
pText <- rule$ _
pTags <- rule$ _ -- uses optional
pNumbers <- rule$ _
pGregLine <- rule$ GregLine <$> (pWhitespace *> pNumberOfInitialDots) -- skip initial whitespace
<*> pBoxed_word
<*> pBoxed_letter
<*> pText
<*> pTags
<*> pNumbers
return pGregLine
let me know if you're unfamiliar with "Applicative"s. if so, for now, you can think of f <$> x <*> y
as similar to normal pure function application f x y
, only "with effects". here, the "effects" are "matching tokens and parsing them into some value".
and you can ignore rule
for now, it just says it's argument is a grammatical production with the left hand side, rather than just a grammatical expression with only a right-hand side.
(:: Prod r e Char Char)
----
| |
pInt <- rule$ read <$> satisfy isDigit
| | |_____________|
| | (:: Prod r e Char Char)
| |______________________|
| (:: Prod r e Char Int) |
| |
| |
|____________________________|
(:: Grammar r (Prod r e Char Int))
by the way, how did you come across this package?
and let me know if you need more info.
from earley.
@sboosali: I've added you as a collaborator to the repository so you can add this to the documentation if you're up for it. I suggest adding it to the docs
directory and linking it somewhere appropriate in the README. I can help flesh it out when I find the time. A minor issue from a quick look is that the types of a few of the combinators are incorrect: the list and maybe types should be "pushed into" the return types.
from earley.
thanks, will do (the type typo is noted).
also, I'll try to clearly separate "new to parser combinators" from "new to this package".
from earley.
Related Issues (20)
- Q: How to do error recovery? HOT 1
- Wrong(?) result on some grammars with infinite results HOT 10
- Is Earley’s implementation related to Marpa? HOT 3
- Add a `eof` terminal HOT 5
- Behavior of Semigroup/Monoid instances for Prod? HOT 5
- Capture result as well as matched tokens HOT 2
- Generate ABNF/EBNF from Earley grammar HOT 1
- Suggestion: Make it so that `<$` rules return at most one result. HOT 3
- Parsing into a list delimited by a token HOT 5
- INDENT and DEDENT HOT 5
- feature: constraint validation of production parse result HOT 7
- Please add tags "test ambiguity" etc HOT 2
- Crashes if grammar is infinitely ambiguous on given input HOT 11
- Is it okey to use "many" and heavily use Applicative parsing techniques? HOT 2
- Diverges with sepEndBy from parser-combinator HOT 1
- Report doesn't include enough information HOT 2
- Is there any possible optimization for getting non-exhaustive outputs for ambiguous grammars HOT 1
- Disambiguate/Filter productions HOT 1
- Missing case for Constraint for fmap and <*> HOT 3
- Something like Ruby Slippers parsing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from earley.