rust-bakery / nom Goto Github PK

View Code? Open in Web Editor NEW

9.4K 86.0 803.0 10.48 MB

Rust parser combinator framework

License: MIT License

Rust 100.00%

rust parse parser-combinators nom parser byte-array grammar

nom's People

Contributors

Stargazers

Watchers

Forkers

ryman thehydroimpulse jaredly sujayakar skade filipegoncalves crimsonvoid brandonson lu-zero divarvel pietro keruspe tupshin andrew-d pwoolcoc tempbottle rillian frewsxcv ngrewe meh shengjuntu nox baig ahenry alan-andrade colinsurprenant bluss sourrust ccmtaylor soro minhnhdo meyermagic abbradar hoodie nelsonjchen danburkert steveklabnik luser euclio jsgf aphistic tfviv79 joelself zentner-kyle guillaumegomez dirk sarahhodne experquisite jansegre tstorch jeffbelgum cholcombe973 xion byteslayer7 geogi panicbit starblue danielkeep adamgreig cite-reader hywan xirdus jespino mspiegel bovee paul-english frgomes chrismacnaughton moreati phlosioneer jugglerchris weiznich sorccu fitzgen acatton jethrogb jdeeny kellerfuchs koivunej sjmackenzie overdrivenpotato foophoof 3rwww1 keeperofdakeys seeekr currymj jturner314 giodamelio bozaro sdn0 davidedmonds uniphil azerupi andyshiue jbaum98 fabricedesre kmizu nickbabcock darkdrek lucab

nom's Issues

Document ConsumerState variant values

Although there are comments describing the variants of ConsumerState::Await and ConsumerState::Seek in https://github.com/Geal/nom/blob/master/src/consumer.rs#L76, there are no documentation comments on these variants, and thus no indication at all as to what these variants do in the documentation.

Fixing this would be as simple as adding /// (amount_consumed, buffer_needed) before Await and /// (consumed, new_position, buffer_needed) before Seek. That wouldn't be very much documentation, but it would be hugely useful for those trying to figure out ConsumerState from just looking at documentation.

Outdated Consumer example in README

An example of consumer implementation provided in README doesn't implement failed method, so it doesn't compile 😞

Many macros could declare closures rather than just functions

As a quick example, a rewritten alt!()

#[macro_export]
macro_rules! alt (
    ($name:ident<$i:ty,$o:ty>, $($rest:tt)*) => (
        fn $name(i:$i) -> IResult<$i,$o>{
            alt_parser!(i | $($rest)*)
        }
    );
    ($($rest:tt)*) => ( | i | { alt_parser!(i | $($rest)*) } );
);

Just one additional line, and now omitting the name allows it to be used in value position.

Bit level parsing

Since nom is mostly generic, it should be possible to apply parsers on bit slices. The naive way is to do it like this:

let bv = BitVec::from_bytes(&data[..]);
let bits: Vec<bool> = bits.iter().collect();

This is not very efficient, and a lot of bit level manipulations are calculations, so working with booleans is not the right way.

is_a!() contains a debugging println!()

On line 470.

take_until_either! returns error when bytes aren't found

named!(xxx, take_until_either!("!."));

#[test]
fn test_take_until_either() {
    assert_eq!(
        xxx(&b"123"[..]),
        nom::IResult::Incomplete(nom::Needed::Unknown)
    );
}
//
//thread 'test_take_until_either' panicked at 'assertion failed: `(left == right)` (left: `Error(Position(14, [49, 50, 51]))`, right: `Incomplete(Unknown)`)'

I'd expect the result of the above to be Incomplete(Unknown) because it's unknown how much more must be read until any of the bytes are found. Is the error result correct?

Also compare this to take_until...

named!(xxx2, take_until!("end"));

#[test]
fn test_take_until() {
    assert_eq!(
        xxx2(&b"123"[..]),
        nom::IResult::Incomplete(nom::Needed::Unknown)
    );
}
//
//thread 'test_take_until' panicked at 'assertion failed: `(left == right)` (left: `Incomplete(Size(4))`, right: `Incomplete(Unknown)`)'

Separate naming of rules from definition of rules

Simple enough; it'd really just be the following + the closure-based stuff:

#[macro_export]
macro_rules! named (
    ($name:ident<$i:ty,$o:ty>, $rule:tt) => (
        fn<$i,$o>( i: $i ) -> IResult<$i,$o> {
            ($rule)( i )
        }
    );
    ($name:ident, $rule:tt) => (
        named!( $name<&[u8], &[u8]>, $rule )
    );
);

If you wanted to be really fancy, you could abstract the _parser! stuff like so:

#[macro_export]
macro_rules! parser (
    (alt, $i:ident, $($rest:tt)*) => ( alt_parser!( $i | $rest ) );
    ...
);

and then replace $rule:tt and ($rule)( i ) with $rule:ident, $($rest:tt)* and parser!( $rule, i, $rest ) in named.

Then you'd be able to say named!( foo, take, 3 ) and it wouldn't even have any closure overhead 😃

Make a size_buffer combinator

Currently, we have length_value taking the first byte as length then absorbing a buffer of that size, and length_value! taking the result of the first parser as count, then applying the second parser that many time.

There should be a combinator whose argument is a parser returning a number, then returning a buffer of that size (to be able to use be_u16, be_u32 and others as size parser).

Network producer

nom should be able to get data from the network and parse it as soon as it is available

Missing Readme file.

Steps to reproduce: try reading the Readme file.

Expected result: the readme file explains what this is about.
Actual result: there is no readme file.

Error(u32) is not used currently

An error code might not be the best way to represent that something went wrong. Returning an accumulation of sub parser errors could indicate what parsing path failed, instead of a global parsing error.

alt!() is not commutative

I came across something interesting when writing a parser to match a string literal with escape sequences.

Consider this code:

// ~~~ String literal parser and auxiliary parsers ~~~
named!(not_escaped_seq<&[u8], &[u8]>, take_until_either!(&b"\\\""[..]));
named!(escaped_seq, alt!(tag!("\\r") | tag!("\\n") | tag!("\\t") | tag!("\\\"") | tag!("\\\\")));
named!(string_literal<&[u8], String>,
       chain!(
           tag!("\"") ~
           s: many0!(map_res!(alt!(escaped_seq | not_escaped_seq), from_utf8)) ~
           tag!("\""),
           || {
               syntax::parse::str_lit(&s.into_iter().fold(String::new(),
                                                          |mut accum, slice| {
                                                              accum.push_str(slice);
                                                              accum
                                                          })[..])}));

It matches string literals that can contain any of the escaped sequences listed in escaped_seq. This parser works as expected, however, switching the order of the options in alt!(escaped_seq | not_escaped_seq) makes the parser unable to recognize any string literal that contains at least an escape sequence.

That is, replacing this line:

           s: many0!(map_res!(alt!(escaped_seq | not_escaped_seq), from_utf8)) ~

With:

           s: many0!(map_res!(alt!(not_escaped_seq | escaped_seq), from_utf8)) ~

Breaks the parser. Here are 2 test cases:

    #[test]
    fn single_str_scalar_value() {
        let input = &b"\"a string literal\""[..];
        let res = str_scalar_value(input);
        assert_eq!(res, Done(&b""[..], "a string literal".to_string()));        
    }

    #[test]
    fn single_str_scalar_value2() {
        let input = &b"\"A backslash in quotes: \\\"\\\\\\\"\""[..];
        let res = str_scalar_value(input);
        assert_eq!(res, Done(&b""[..], "A backslash in quotes: \"\\\"".to_string()));       
    }

The former passes with both versions; the latter fails with the 2nd version of the parser (the parser returns an Error). In general, any string with an escaped sequence is not recognized by the 2nd version of the parser.

Shouldn't alt be commutative?

Make count! return a fixed-size array

Junk in the cargo package source

nom 0.3.10 has lots of large files in the cargo package source -- a 5 MB mp4 and more files olddoc, oldsrc ~~(see at the end for file listing).~~ (removed)

This is a reminder that cargo includes all non-ignored files in your working directory when you publish — look at git status before you publish.

(I downloaded all crates.io crates and I started to grep for junk)

Take remaining bytes.

Right now I'm doing this:

named!(verbatim(&'a [u8]) -> &'a str, map_res!(
    alt!(take_until_either!("|^*$#") | rest),
    str::from_utf8));

fn rest(i: &[u8]) -> IResult<&[u8], &[u8]> {
    IResult::Done(&i[i.len()..], i)
}

I need to take until one of those characters happens, or read up to the end and finish parsing. Is there a better way to do it currently?

Document the be_* and le_* parsers

How get values from pusher! macro?

Hi!
I used pusher!() macro with FileProducer. But generated code doesn't return parser results.
For example, I have

pub struct Test {...}
...
named!(parse_test<&[u8],Test>, ...)

I want get [Test] or iterable object of Test sequence.
Does nom have appropriate macro or example?

the is_a! macro has no tests

Steps to fix:

write some tests
get a cookie
???
omnomnomnom

Make Consumer::end and Consumer::run return an output value on success

Should make it more ergonomic to use it to parse data

Arithmetic expression example

Parsing expressions such as 1 + 2 * 3 is a common parsing example, there should be some code to show it in nom

The data in Incomplete is not used right now

Most of the Incomplete(usize) instances return 0 right now. Here are the possible fixes:

remove the field entirely, and let the calling code manage data aggregation automatically. This is easy (and corresponds to how the current code works).
return a sum type, something like Unknown|Size(usize). This adds more code in pattern matches, but it handles the case where we do not know how much data should be returned, and the case where we know, and the calling parser can augment it

There is still the problem that parsers are just functions, and do not have a value attached for the minimal data size they could need.

Still, returning a needed size is useful in cases where you need to seek, or load a large part of a file in memory, instead of chunking.

Consumer.run() panics when consume() requests more data than available

The run() method currently doesn't check correctly whether the producer has given it enough data to meet what the consume() method requested. You'll drop out of the data collection loop because you are at eof (same thing could happen because of a ProducerError as well), with needed > acc.len(), and then try to get a slice that extends beyond the end of the buffer.

This test case demonstrates the problem:

   struct TestConsumer {
       done : bool
   }

   impl Consumer for TestConsumer {
       fn end(&mut self) {
       }

  fn consume(&mut self, input: &[u8]) -> ConsumerState {
    if self.done {
        ConsumerState::ConsumerDone
    }  else if input.len() < 2 {
        ConsumerState::Await(0,2)
    } else {
        self.done = true;
        ConsumerState::ConsumerDone
       }
    }

   fn failed(&mut self, error_code: u32) {
        println!("failed with error code: {}", error_code);
   }
}

  #[test]
  fn overrun() {
      let mut p = MemProducer::new(&b"a"[..], 1);
      let mut c = TestConsumer{ done: false };
      c.run(&mut p);
      assert_eq!(c.done, false);
  }

The right thing would probably be to call failed(), but that usually takes error codes produced by the consumer as an argument, so I'm not sure what to do here.

`alt!` + `map!` + `call!` strange behavior

alt! + map! + call! have strange behavior.

This code work. See to closure in map!:

named!(range<&[u8], Range>,
    alt!(
        chain!(
            start: take_char ~
            tag!("-") ~
            end: take_char,
            || {
                debug!("range: (start, end): ({:?}, {:?})", start, end);
                Range {
                    start: start,
                    end: end,
                }
            }
        ) |
        map!(
            take_char,
            |c| {
                debug!("range: c: {:?}", c);
                Range {
                    start: c,
                    end: c,
                }
            }
        )
    )
);

If we try to wrap closure in map! by call!, it will not work:

...
        map!(
            take_char,
            call!(|c| {
                debug!("range: c: {:?}", c);
                Range {
                    start: c,
                    end: c,
                }
            })
        )
...

Error:

src/parser.rs:212:13: 212:14 error: unexpected end of macro invocation
src/parser.rs:212             })
                              ^

But if we use map! without alt! it have different behavior. Next code work:

named!(literal<&[u8], Expr>,
    map!(
        many1!(take_char),
        call!(|cs| {
            debug!("literal: cs: {:?}", cs);
            Expr::Literal {
                chars: cs,
            }
        })
    )
);

Without call! it not work:

named!(literal<&[u8], Expr>,
    map!(
        many1!(take_char),
        |cs| {
            debug!("literal: cs: {:?}", cs);
            Expr::Literal {
                chars: cs,
            }
        }
    )
);

Error:

src/parser.rs:220:9: 220:10 error: expected ident, found |
src/parser.rs:220         |cs| {
                          ^

Regular expression parser

Parsers using regular expressions would be useful.

cf scala's parser combinators for an example

Consider using typed errors

Currently errors are only a u32 code, which is not very explicit to interpret.

It would be more user-friendly to use an enum to store error codes, like std::io::Error which gives easy access to the ErrorKind enum.

It is probably just a matter of integrating this enum https://github.com/Geal/nom/blob/master/src/util.rs#L466-L493 and adding it a Custom(_) variant

error: macro undefined: 'delimited1!'

Then uses delimited! macro it throw next error:

<nom macros>:8:32: 8:42 error: macro undefined: 'delimited1!'
<nom macros>:8 IResult:: Done ( i1 , _ ) => { delimited1 ! ( i1 , $ ( $ rest ) * ) } } } ) ;
                                              ^~~~~~~~~~

Reason:
delimited1! and delimited2! macros haven't #[macro_export] attributes

Difference with Parsec nomenclature

Some of Parsec's functions are available in nom, but not all of them, and not always with the same name:

Document alt! block variants

Make single argument version of map_opt! and map_res!

There should be a combinator that wraps a function returning an Option or a Result directly, without mapping over the result of a first parser.

Example parsers

We currently have a few example parsers. In order to test the project and make it useful, other formats can be implemented. Here is a list, if anyone wants to try it:

text file formats:
- INI
- FASTQ
- libconfig-like configuration file format
- torrc configuration file
- ISO 8601 dates
- Web archive
- TOML
- bencode
- CSV
- YAML
- CommonMark
audio, video and image formats:
- MP4 (partial implementation)
- GIF
- FLAC
- FLV
- MKV
- OGG
- MPEG TS
- AVI
- PNG
- JPEG
- EXIF
- MP3
document formats:
- torrent files
- TAR
- PDF
- MS-CFB (compound format, used in doc, xls, ppt, cab, msi files)
- GZ
- ZIP
- RAR
- binary PLIST
database formats:
- Redis database files
- Ceph crush maps
network protocol formats:
- IRC
- Pcap-NG
- IP
- Ethernet
- PCAP
- NTP
- SNMP
- TLS
- TCP
- UDP
- DNS
executable formats:
- Portable executables (PE)
- ELF
- GameBoy ROM
crypto related:
- ASN.1
- X.509 certificates
- DER public and private keys
- SSL/TLS packets
- OpenPGP
Programming Languages
- Rust
- Lua
- Python
- C
interface definition formats:
- Thrift
- Protobuf
- AIDL

[src] hyperlink links to wrong section of source

Pressing [src] in http://rust.unhandledexpression.com/nom/fn.multispace.html

links to

http://rust.unhandledexpression.com/src/nom/nom.rs.html#139-151

The consumed field in Await is confusing

Right now, the consume() method has to calculate every time how much data has been consumed. This complicates the code and makes it error prone.

One solution could be returning the remaining input, and let the run() method calculate how much data has been consumed.

Channel producer

It should be possible to build a producer from an incoming channel, to parse data sent from another thread

Permutation parser

Is it possible to use namespaced functions in chain!()?

When using chain!(), using space works, but not nom::space: error: no rules expected the token::``.

Would it be possible to support this, or is this impossible/too complicated with the current macro system and the way chain!() is built? It would be nice to not have to use all functions to use in chain.

This is mostly a question of whether this is currently possible, it isn't really needed if it isn't possible.

tag! and byte arrays

What is the correct way to use tag! with a fixed byte array? tag!([42u8, 42u8]) fails, as AsBytes is not implemented for [u8; 2].

Bit field parsers

As mentioned by @rrichardson in UpstandingHackers/hammer#64 (comment), parsers that transform from bit positions to a tuple of fields would be useful:

pub fn be_bits_1<'a, A>(i: &[u8], u8) -> IResult<'a,&[u8], (A)>
pub fn be_bits_2<'a, A, B>(i: &[u8], u8, u8) -> IResult<'a,&[u8],( A, B )>
pub fn be_bits_2<'a, A, B>(i: &[u8], u8, u8, u8) -> IResult<'a,&[u8],( A, B, C )>.
//.. up to 8 or so and also for le

// + a small bit of  macro magic to streamline the chaining 

// So to parse something like a TCP header, one would do something like: 
chain!(
    src_prt::  be_u16 ~ 
    dst_prt :  be_u16 ~ 
    seq_num: be_u32 ~
    ack_num: be_u32 ~
    (offs, _, flags) : be_bits_3( 4, 3, 9 ) ~ 
    blah: blah ~
)

count_fixed! macro doesn't seem to expand properly

Hi there,

I tried fixing this myself, but I don't understand why it's not working - sorry 😞

Essentially, this works (in a chain!):

        e_res2:     count_fixed!( call!(le_u16), u16, 10 ) ~

But this does not, despite this macro case:

        e_res2:     count_fixed!( le_u16, u16, 10 ) ~

Here's where I'm using it, if that's helpful.

not_line_ending returns Incomplete when it does not find its terminating byte

not_line_ending should return the whole array if it did not find it, and let the calling parser handle accumulating data

Consumer::run doesn't stop after ConsumerState::ConsumerError is returned

If you return ConsumerError from Consumer::consume, Consumer::run will essentially run in an infinite loop. I think this is due to the empty match branch in https://github.com/Geal/nom/blob/master/src/consumer.rs#L162.

Consumer::run() should ideally stop execution (at least stop calling Consumer::consume with invalid data) when this happens. With the current behavior, it's impossible to recover from parsing invalid data using a Consumer.

nom consuming 100% cpu

I am exploring the possibility of switching to nom in a project I am working on. I am not fully familiar with nom yet, so please bear with me.

For starters, I was trying to come up with a parser that matches strings of the form [a-zA-Z][-a-zA-Z0-9_]*. I wrote this:

#[macro_use]
extern crate nom;

use std::str::from_utf8;

use nom::{alpha, alphanumeric};
use nom::{IResult, Needed};
use nom::IResult::*;

named!(identifier<&[u8], String>,
       chain!(
           h: map_res!(alpha, from_utf8) ~
           t: many0!(alt!(alphanumeric | tag!("-") | tag!("_"))),
           || {
               let  s = h.to_string();
               t.into_iter().fold(s, |mut accum, slice| {
                   accum.push_str(from_utf8(slice).unwrap()); accum })}));

And I tested it with:

    #[test]
    fn id_name() {
        let a_setting = &b"miles"[..];
        let res = setting_name(a_setting);
        assert_eq!(res, Done(&b""[..], "miles".to_string()));
    }

When I run cargo test my PC completely hangs. With top I can see that it starts consuming more and more CPU and memory until the entire system is completely unusable and I have to hard reset.

Am I doing something wrong? Is this the best way to make a parser to match this type of strings?

Add "switch" parser combinator macro

It's quite common pattern in binary protocols to use input[0] as message type and the remaining as the message body. Proposed to add macro for the following "switch" parser:

fn parse (i: &[u8]) -> IResult<&[u8], T> {
    match takeN!(i) {
        IResult::Done(i, o) => {
            match o {
                1 => parse_a(i),
                2 => parse_b(i),
                3 => parse_c(i),
                ...
                _ => IResult::Error(Err::Code(1))
            }
        }
        IResult::Error(e) => IResult::Error(e),
        IResult::Incomplete(n) => IResult::Incomplete(n),
    }
}

Public named functions/parsers

I guess re-exporting would be an alternative to adding a public export option to named parsers. What do you think of adding something like named!(pub foo<&u8>, ...)?

Provide some way to access nom::Needed values

I know the documentation says (for now the value is ignored, but it should indicate how much is needed), but it seems like most built-in parsers return reasonable values for this, and it would be super nice to be able to use when returning a ConsumerState::Await value from a consumer.

I would think this would be as simple as:

nom::IResult::Incomplete(x) => {
    let x = match x {
        nom::internal::Needed::Size(x) => x,
        nom::internal::Needed::Unknown => 1,
    };
    ConsumerState::Await(0, x)
},

But alas, this does not work, due to Needed::Size and Needed::Unknown both being private variants. (error: variantSizeis private)

It would be nice to make at least Needed::Size public, or to add a possible_size(&self) method which would return Option<usize>.

Producers produce fixed size chunks

There are cases where we do not know how much data we need at first, but after getting a header, we know what chunk size would be optimal.
Making the producers able to produce arbitrarily sized chunks could be useful

Why uses `IResult` instead `std::result::Result<Status<I, O>, Error>`?

Why uses:

pub enum IResult<I,O> {
  Done(I,O),
  Error(Err),
  Incomplete(u32)
}

why not (for example):

pub enum Status<I, O> {
    Done(I, O),
    Incomplete(u32)
}

type Result<I, O> = std::result::Result<Status<I, O>, Error>

because of this there is no possibility to use map_err or try!...
What causes?