rust-bakery / nom Goto Github PK
View Code? Open in Web Editor NEWRust parser combinator framework
License: MIT License
Rust parser combinator framework
License: MIT License
Although there are comments describing the variants of ConsumerState::Await
and ConsumerState::Seek
in https://github.com/Geal/nom/blob/master/src/consumer.rs#L76, there are no documentation comments on these variants, and thus no indication at all as to what these variants do in the documentation.
Fixing this would be as simple as adding /// (amount_consumed, buffer_needed)
before Await and /// (consumed, new_position, buffer_needed)
before Seek. That wouldn't be very much documentation, but it would be hugely useful for those trying to figure out ConsumerState
from just looking at documentation.
An example of consumer implementation provided in README doesn't implement failed
method, so it doesn't compile ๐
As a quick example, a rewritten alt!()
#[macro_export]
macro_rules! alt (
($name:ident<$i:ty,$o:ty>, $($rest:tt)*) => (
fn $name(i:$i) -> IResult<$i,$o>{
alt_parser!(i | $($rest)*)
}
);
($($rest:tt)*) => ( | i | { alt_parser!(i | $($rest)*) } );
);
Just one additional line, and now omitting the name allows it to be used in value position.
Since nom is mostly generic, it should be possible to apply parsers on bit slices. The naive way is to do it like this:
let bv = BitVec::from_bytes(&data[..]);
let bits: Vec<bool> = bits.iter().collect();
This is not very efficient, and a lot of bit level manipulations are calculations, so working with booleans is not the right way.
On line 470.
named!(xxx, take_until_either!("!."));
#[test]
fn test_take_until_either() {
assert_eq!(
xxx(&b"123"[..]),
nom::IResult::Incomplete(nom::Needed::Unknown)
);
}
//
//thread 'test_take_until_either' panicked at 'assertion failed: `(left == right)` (left: `Error(Position(14, [49, 50, 51]))`, right: `Incomplete(Unknown)`)'
I'd expect the result of the above to be Incomplete(Unknown) because it's unknown how much more must be read until any of the bytes are found. Is the error result correct?
Also compare this to take_until...
named!(xxx2, take_until!("end"));
#[test]
fn test_take_until() {
assert_eq!(
xxx2(&b"123"[..]),
nom::IResult::Incomplete(nom::Needed::Unknown)
);
}
//
//thread 'test_take_until' panicked at 'assertion failed: `(left == right)` (left: `Incomplete(Size(4))`, right: `Incomplete(Unknown)`)'
Simple enough; it'd really just be the following + the closure-based stuff:
#[macro_export]
macro_rules! named (
($name:ident<$i:ty,$o:ty>, $rule:tt) => (
fn<$i,$o>( i: $i ) -> IResult<$i,$o> {
($rule)( i )
}
);
($name:ident, $rule:tt) => (
named!( $name<&[u8], &[u8]>, $rule )
);
);
If you wanted to be really fancy, you could abstract the _parser!
stuff like so:
#[macro_export]
macro_rules! parser (
(alt, $i:ident, $($rest:tt)*) => ( alt_parser!( $i | $rest ) );
...
);
and then replace $rule:tt
and ($rule)( i )
with $rule:ident, $($rest:tt)*
and parser!( $rule, i, $rest )
in named.
Then you'd be able to say named!( foo, take, 3 )
and it wouldn't even have any closure overhead ๐
Currently, we have length_value
taking the first byte as length then absorbing a buffer of that size, and length_value!
taking the result of the first parser as count, then applying the second parser that many time.
There should be a combinator whose argument is a parser returning a number, then returning a buffer of that size (to be able to use be_u16, be_u32 and others as size parser).
nom should be able to get data from the network and parse it as soon as it is available
Steps to reproduce: try reading the Readme file.
Expected result: the readme file explains what this is about.
Actual result: there is no readme file.
An error code might not be the best way to represent that something went wrong. Returning an accumulation of sub parser errors could indicate what parsing path failed, instead of a global parsing error.
I came across something interesting when writing a parser to match a string literal with escape sequences.
Consider this code:
// ~~~ String literal parser and auxiliary parsers ~~~
named!(not_escaped_seq<&[u8], &[u8]>, take_until_either!(&b"\\\""[..]));
named!(escaped_seq, alt!(tag!("\\r") | tag!("\\n") | tag!("\\t") | tag!("\\\"") | tag!("\\\\")));
named!(string_literal<&[u8], String>,
chain!(
tag!("\"") ~
s: many0!(map_res!(alt!(escaped_seq | not_escaped_seq), from_utf8)) ~
tag!("\""),
|| {
syntax::parse::str_lit(&s.into_iter().fold(String::new(),
|mut accum, slice| {
accum.push_str(slice);
accum
})[..])}));
It matches string literals that can contain any of the escaped sequences listed in escaped_seq
. This parser works as expected, however, switching the order of the options in alt!(escaped_seq | not_escaped_seq)
makes the parser unable to recognize any string literal that contains at least an escape sequence.
That is, replacing this line:
s: many0!(map_res!(alt!(escaped_seq | not_escaped_seq), from_utf8)) ~
With:
s: many0!(map_res!(alt!(not_escaped_seq | escaped_seq), from_utf8)) ~
Breaks the parser. Here are 2 test cases:
#[test]
fn single_str_scalar_value() {
let input = &b"\"a string literal\""[..];
let res = str_scalar_value(input);
assert_eq!(res, Done(&b""[..], "a string literal".to_string()));
}
#[test]
fn single_str_scalar_value2() {
let input = &b"\"A backslash in quotes: \\\"\\\\\\\"\""[..];
let res = str_scalar_value(input);
assert_eq!(res, Done(&b""[..], "A backslash in quotes: \"\\\"".to_string()));
}
The former passes with both versions; the latter fails with the 2nd version of the parser (the parser returns an Error
). In general, any string with an escaped sequence is not recognized by the 2nd version of the parser.
Shouldn't alt
be commutative?
nom 0.3.10 has lots of large files in the cargo package source -- a 5 MB mp4 and more files olddoc, oldsrc (see at the end for file listing). (removed)
This is a reminder that cargo includes all non-ignored files in your working directory when you publish โ look at git status
before you publish.
(I downloaded all crates.io crates and I started to grep for junk)
Right now I'm doing this:
named!(verbatim(&'a [u8]) -> &'a str, map_res!(
alt!(take_until_either!("|^*$#") | rest),
str::from_utf8));
fn rest(i: &[u8]) -> IResult<&[u8], &[u8]> {
IResult::Done(&i[i.len()..], i)
}
I need to take until one of those characters happens, or read up to the end and finish parsing. Is there a better way to do it currently?
Hi!
I used pusher!() macro with FileProducer. But generated code doesn't return parser results.
For example, I have
pub struct Test {...}
...
named!(parse_test<&[u8],Test>, ...)
I want get [Test] or iterable object of Test sequence.
Does nom have appropriate macro or example?
Steps to fix:
Should make it more ergonomic to use it to parse data
Parsing expressions such as 1 + 2 * 3
is a common parsing example, there should be some code to show it in nom
Most of the Incomplete(usize)
instances return 0 right now. Here are the possible fixes:
Unknown|Size(usize)
. This adds more code in pattern matches, but it handles the case where we do not know how much data should be returned, and the case where we know, and the calling parser can augment itThere is still the problem that parsers are just functions, and do not have a value attached for the minimal data size they could need.
Still, returning a needed size is useful in cases where you need to seek, or load a large part of a file in memory, instead of chunking.
The run() method currently doesn't check correctly whether the producer has given it enough data to meet what the consume() method requested. You'll drop out of the data collection loop because you are at eof (same thing could happen because of a ProducerError
as well), with needed > acc.len()
, and then try to get a slice that extends beyond the end of the buffer.
This test case demonstrates the problem:
struct TestConsumer {
done : bool
}
impl Consumer for TestConsumer {
fn end(&mut self) {
}
fn consume(&mut self, input: &[u8]) -> ConsumerState {
if self.done {
ConsumerState::ConsumerDone
} else if input.len() < 2 {
ConsumerState::Await(0,2)
} else {
self.done = true;
ConsumerState::ConsumerDone
}
}
fn failed(&mut self, error_code: u32) {
println!("failed with error code: {}", error_code);
}
}
#[test]
fn overrun() {
let mut p = MemProducer::new(&b"a"[..], 1);
let mut c = TestConsumer{ done: false };
c.run(&mut p);
assert_eq!(c.done, false);
}
The right thing would probably be to call failed()
, but that usually takes error codes produced by the consumer as an argument, so I'm not sure what to do here.
alt!
+ map!
+ call!
have strange behavior.
This code work. See to closure in map!
:
named!(range<&[u8], Range>,
alt!(
chain!(
start: take_char ~
tag!("-") ~
end: take_char,
|| {
debug!("range: (start, end): ({:?}, {:?})", start, end);
Range {
start: start,
end: end,
}
}
) |
map!(
take_char,
|c| {
debug!("range: c: {:?}", c);
Range {
start: c,
end: c,
}
}
)
)
);
If we try to wrap closure in map!
by call!
, it will not work:
...
map!(
take_char,
call!(|c| {
debug!("range: c: {:?}", c);
Range {
start: c,
end: c,
}
})
)
...
Error:
src/parser.rs:212:13: 212:14 error: unexpected end of macro invocation
src/parser.rs:212 })
^
But if we use map!
without alt!
it have different behavior. Next code work:
named!(literal<&[u8], Expr>,
map!(
many1!(take_char),
call!(|cs| {
debug!("literal: cs: {:?}", cs);
Expr::Literal {
chars: cs,
}
})
)
);
Without call!
it not work:
named!(literal<&[u8], Expr>,
map!(
many1!(take_char),
|cs| {
debug!("literal: cs: {:?}", cs);
Expr::Literal {
chars: cs,
}
}
)
);
Error:
src/parser.rs:220:9: 220:10 error: expected ident, found |
src/parser.rs:220 |cs| {
^
Parsers using regular expressions would be useful.
cf scala's parser combinators for an example
Currently errors are only a u32
code, which is not very explicit to interpret.
It would be more user-friendly to use an enum to store error codes, like std::io::Error
which gives easy access to the ErrorKind
enum.
It is probably just a matter of integrating this enum https://github.com/Geal/nom/blob/master/src/util.rs#L466-L493 and adding it a Custom(_)
variant
Then uses delimited!
macro it throw next error:
<nom macros>:8:32: 8:42 error: macro undefined: 'delimited1!'
<nom macros>:8 IResult:: Done ( i1 , _ ) => { delimited1 ! ( i1 , $ ( $ rest ) * ) } } } ) ;
^~~~~~~~~~
Reason:
delimited1!
and delimited2!
macros haven't #[macro_export]
attributes
Some of Parsec's functions are available in nom, but not all of them, and not always with the same name:
peek!
many0!
count_fixed!
delimited!(open, p, end)
opt!
There should be a combinator that wraps a function returning an Option
or a Result
directly, without mapping over the result of a first parser.
We currently have a few example parsers. In order to test the project and make it useful, other formats can be implemented. Here is a list, if anyone wants to try it:
Right now, the consume()
method has to calculate every time how much data has been consumed. This complicates the code and makes it error prone.
One solution could be returning the remaining input, and let the run()
method calculate how much data has been consumed.
It should be possible to build a producer from an incoming channel, to parse data sent from another thread
When using chain!(), using space
works, but not nom::space
: error: no rules expected the token
::``.
Would it be possible to support this, or is this impossible/too complicated with the current macro system and the way chain!() is built? It would be nice to not have to use
all functions to use in chain.
This is mostly a question of whether this is currently possible, it isn't really needed if it isn't possible.
What is the correct way to use tag! with a fixed byte array? tag!([42u8, 42u8])
fails, as AsBytes is not implemented for [u8; 2]
.
As mentioned by @rrichardson in UpstandingHackers/hammer#64 (comment), parsers that transform from bit positions to a tuple of fields would be useful:
pub fn be_bits_1<'a, A>(i: &[u8], u8) -> IResult<'a,&[u8], (A)>
pub fn be_bits_2<'a, A, B>(i: &[u8], u8, u8) -> IResult<'a,&[u8],( A, B )>
pub fn be_bits_2<'a, A, B>(i: &[u8], u8, u8, u8) -> IResult<'a,&[u8],( A, B, C )>.
//.. up to 8 or so and also for le
// + a small bit of macro magic to streamline the chaining
// So to parse something like a TCP header, one would do something like:
chain!(
src_prt:: be_u16 ~
dst_prt : be_u16 ~
seq_num: be_u32 ~
ack_num: be_u32 ~
(offs, _, flags) : be_bits_3( 4, 3, 9 ) ~
blah: blah ~
)
Hi there,
I tried fixing this myself, but I don't understand why it's not working - sorry ๐
Essentially, this works (in a chain!
):
e_res2: count_fixed!( call!(le_u16), u16, 10 ) ~
But this does not, despite this macro case:
e_res2: count_fixed!( le_u16, u16, 10 ) ~
Here's where I'm using it, if that's helpful.
not_line_ending
should return the whole array if it did not find it, and let the calling parser handle accumulating data
If you return ConsumerError from Consumer::consume, Consumer::run will essentially run in an infinite loop. I think this is due to the empty match branch in https://github.com/Geal/nom/blob/master/src/consumer.rs#L162.
Consumer::run() should ideally stop execution (at least stop calling Consumer::consume with invalid data) when this happens. With the current behavior, it's impossible to recover from parsing invalid data using a Consumer.
I am exploring the possibility of switching to nom
in a project I am working on. I am not fully familiar with nom
yet, so please bear with me.
For starters, I was trying to come up with a parser that matches strings of the form [a-zA-Z][-a-zA-Z0-9_]*
. I wrote this:
#[macro_use]
extern crate nom;
use std::str::from_utf8;
use nom::{alpha, alphanumeric};
use nom::{IResult, Needed};
use nom::IResult::*;
named!(identifier<&[u8], String>,
chain!(
h: map_res!(alpha, from_utf8) ~
t: many0!(alt!(alphanumeric | tag!("-") | tag!("_"))),
|| {
let s = h.to_string();
t.into_iter().fold(s, |mut accum, slice| {
accum.push_str(from_utf8(slice).unwrap()); accum })}));
And I tested it with:
#[test]
fn id_name() {
let a_setting = &b"miles"[..];
let res = setting_name(a_setting);
assert_eq!(res, Done(&b""[..], "miles".to_string()));
}
When I run cargo test
my PC completely hangs. With top
I can see that it starts consuming more and more CPU and memory until the entire system is completely unusable and I have to hard reset.
Am I doing something wrong? Is this the best way to make a parser to match this type of strings?
It's quite common pattern in binary protocols to use input[0] as message type and the remaining as the message body. Proposed to add macro for the following "switch" parser:
fn parse (i: &[u8]) -> IResult<&[u8], T> {
match takeN!(i) {
IResult::Done(i, o) => {
match o {
1 => parse_a(i),
2 => parse_b(i),
3 => parse_c(i),
...
_ => IResult::Error(Err::Code(1))
}
}
IResult::Error(e) => IResult::Error(e),
IResult::Incomplete(n) => IResult::Incomplete(n),
}
}
I guess re-exporting would be an alternative to adding a public export option to named parsers. What do you think of adding something like named!(pub foo<&u8>, ...)
?
I know the documentation says (for now the value is ignored, but it should indicate how much is needed)
, but it seems like most built-in parsers return reasonable values for this, and it would be super nice to be able to use when returning a ConsumerState::Await
value from a consumer.
I would think this would be as simple as:
nom::IResult::Incomplete(x) => {
let x = match x {
nom::internal::Needed::Size(x) => x,
nom::internal::Needed::Unknown => 1,
};
ConsumerState::Await(0, x)
},
But alas, this does not work, due to Needed::Size
and Needed::Unknown
both being private variants. (error: variant
Sizeis private
)
It would be nice to make at least Needed::Size
public, or to add a possible_size(&self)
method which would return Option<usize>
.
There are cases where we do not know how much data we need at first, but after getting a header, we know what chunk size would be optimal.
Making the producers able to produce arbitrarily sized chunks could be useful
Why uses:
pub enum IResult<I,O> {
Done(I,O),
Error(Err),
Incomplete(u32)
}
why not (for example):
pub enum Status<I, O> {
Done(I, O),
Incomplete(u32)
}
type Result<I, O> = std::result::Result<Status<I, O>, Error>
because of this there is no possibility to use map_err
or try!
...
What causes?
Most of the parsers work on byte arrays, but IResult
is completely generic, so accepting BitVec
as input should be possible.
The result is that if one wants to match an exact quantity, then one has to hand-roll the full implementation.
This can be a pain to do with parser combinators. Maybe a new combinator would make things easier.
Not sure if this was an accident or not: https://github.com/Geal/nom/blob/0f5ef87483ed45c0cc623a9f5d4403bc645b9aba/src/nom.rs#L375
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.