leopoldarkham / molten Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 8.0 325 KB

[WIP] Molten - Style-preserving TOML parser.

License: Apache License 2.0

Rust 100.00%

parser rust style-preserving toml

molten's People

Contributors

Stargazers

Watchers

Forkers

ehuss chrisvittal kimsnj archer884 jeremydavis519 jualtmey killercup

molten's Issues

Fix visibility issues

Currently I think visibility might be a mess.

Anything parser related should be private
Some methods on the public types should be public, some should be pub(crate)

If someone can go through everything and make sure scoping and visibility are correct, that'd be great!

Fuzzer panic: byte index is out of bounds

The fuzzer from #40 finds a panic triggered by a byte index that is out of bounds. This may be related to #41.

Adding this to the parser tests:

#[test]
fn issue42() {
    let text = ::std::str::from_utf8(b"\'\nv\'f%\nb").unwrap();
    let _ = Parser::new(text).parse();
}

fails with

---- parser::tests::issue42 stdout ----
	thread 'parser::tests::issue42' panicked at 'byte index 9 is out of bounds of `'
v'f%
b`', src/libcore/str/mod.rs:2218:8
stack backtrace:
   0:        0x1048c54fb - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::h42b4ce0b98574016
   1:        0x1048be1de - std::sys_common::backtrace::print::h293f72b3c32dfa5b
   2:        0x1048c5fd0 - _ZN3std9panicking12default_hook28_$u7b$$u7b$closure$u7d$$u7d$17haa05421012ac6c20E.llvm.5B51A422
   3:        0x1048c5cd4 - _ZN3std9panicking12default_hook17h2d0d0ff0f27771f9E.llvm.5B51A422
   4:        0x1048c6436 - std::panicking::rust_panic_with_hook::h2a8b5b7a95208f5a
   5:        0x1048c628e - _ZN3std9panicking11begin_panic17h1179f738121e3414E.llvm.5B51A422
   6:        0x1048c61e3 - std::panicking::begin_panic_fmt::h4986ee369ba2ddac
   7:        0x1048c6152 - rust_begin_unwind
   8:        0x104908183 - core::panicking::panic_fmt::hd9b79e885de0143c
   9:        0x10490ce04 - core::str::slice_error_fail::h606b34e3191fab74
  10:        0x10482138e - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::{{closure}}::h9b2035a79c64f1dc
  11:        0x1048376e0 - <core::option::Option<T>>::unwrap_or_else::h9be067b5aa4c4510
  12:        0x104831cab - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::h195fcc6753118e5e
  13:        0x10482133c - core::str::traits::<impl core::ops::index::Index<core::ops::range::Range<usize>> for str>::index::h779aae528849c22f
  14:        0x10483f026 - Molten::parser::Parser::parse_val::hc4a6ac51f5c68f43
  15:        0x10483e7ee - Molten::parser::Parser::parse_key_value::hb46cee7de7a0d92c
  16:        0x10483deab - Molten::parser::Parser::parse_item::h9ead6e84efb9d9e8
  17:        0x10483c629 - Molten::parser::Parser::parse::h70b405a15af60225
  18:        0x1048339c8 - Molten::parser::tests::issue42::hea1b53b87d26c939
  19:        0x104879871 - _ZN42_$LT$F$u20$as$u20$test..FnBox$LT$T$GT$$GT$8call_box17h07fcc9c8a31cf663E.llvm.B477B8A1
  20:        0x1048d971e - __rust_maybe_catch_panic
  21:        0x10486a110 - std::sys_common::backtrace::__rust_begin_short_backtrace::h727bf0093a90e1fc
  22:        0x10486f907 - _ZN3std9panicking3try7do_call17hc2fcbcc92b537a0fE.llvm.1CF3EA36
  23:        0x1048d971e - __rust_maybe_catch_panic
  24:        0x104885ec1 - _ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h765b683cac353d8aE.llvm.4B1D5347
  25:        0x1048c0c07 - std::sys_common::thread::start_thread::h4ad86c0b3fefeff0
  26:        0x1048cb908 - _ZN3std3sys4unix6thread6Thread3new12thread_start17h333ed09407a7bdcdE.llvm.9AF12AEA
  27:     0x7fffa0c4493a - _pthread_body
  28:     0x7fffa0c44886 - _pthread_start

Parser allocates so much you'd think it's a python interpreter

Currently, everything is allocated into a String; it may not matter much for the size of TOML files, but I'd still like to move to Cow or similar where possible.

Rename LineMeta to Trivia

Self-explanatory, it's a better description of what it contains.

Add tests for, and fix indexing

There should be a test suite for indexing all indexable values. Indexing was broken by #15 so it will need fixing to make the tests green!

Errors should report their position

Initially, we can just convert the current parser index into a (column, line) pair.
Later on, it would be nice to backtrack to the beginning of the item causing a problem.

Reporting a span consisting of two (col, line) pairs is a goal down the line but not a priority. (Good for highlighting errors in an editor, for instance)

Condensed Debug representation

Pretty printing the parser's result... isn't. To make debugging easier, we should override it so it uses less frivolous newlines and indentations, and shows a more ad-hoc representation of containers.

Whitespace after numeric values discarded

The floats and integers tests are currently failing: The parser discards any whitespace that follows the last digit (see failing tests).
Seems to only be a problem in a keyvalue pair, at the end of the line.

Add static "from" methods and allow writing to files

Parser::new() should become Parser::from() and accept anything that AsRef's to &str
Parser::from_file() should be added and accept anything that AsRef's to Path
TOMLDocument::save_to() should be added and accept anything that AsRef's to Path
TOMLDocument::save() should be added, it overwrites the source file if the parser was created through the Parser::from_file() method and returns an error otherwise.

The tests will need to be updated to use these new methods

Incorporate other existing test suites

See this comment

Enforce rules about empty keys

A bare key must be non empty; The parser should enforce this and raise an appropriate issue when needed.

Tests fail on Linux due to newline differences

The code assumes Windows-style line separators (\r\n), but on UNIX-based OSes the tests fail because the line separator is assumed to be (\n) and the generated strings differ from the test sources. The code should use a constant set based on the operating system to allow the comparisons to work regardless of the host operating system.

Inconsistent internal key names

Currently, the objects in the AST use exactly the names given to them in the file. That's good, except that it includes things before a dot. So, for instance, the following TOML file

foo.bar = 1

[foo]
bar = 2

is parsed in such a way that it becomes equivalent to this JSON string:

{
    "foo.bar": 1,
    "foo": {"bar": 2}
}

We can fix this by storing only the last segment of any key, and following the paths to the Containers that hold them. And if a table doesn't exist, it should be created implicitly so that it can hold the container.

So, the process for parsing the TOML example above would be roughly the following:

Read foo and see that it is followed by a dot.
Look in the currently active Container for a table called foo. Since it isn't there, make a new table, marking it as implicitly defined.
Add the key bar with a value of 1 to the new Container.
Read [foo]. See that the active container already contains something called foo, but don't complain because we're explicitly defining a previously implicit table.
Make foo's Container the active one.
Read bar and find that it is already in the active Container. Return a DuplicateKey error.

I'm currently working on fixing this.

Error reporting

This is to discuss how error reporting should work.

Thoughts:

Making attempts to overcome recoverable errors is not currently planned
It would be nice if the error type contained the position where the error occurs. What should that look like?
This project is very amenable to using error-chain, but I've seen people raise concerns about it that have made me shy away from it. Thoughts on whether this is justified and alternatives if it is?
...

Parser accepts invalid numerical values

The spec states:

"Similar to integers, you may use underscores to enhance readability. Each underscore must be surrounded by at least one digit."

I think we currently allow leading and trailing underscores: This should be fixed.
Auditing that part of the code to makes sure there aren't any other issues regarding what we allow/disallow regarding ints and floats would be a good thing in general.

This should probably be looked at together with #2

The code for this is in the Int, float, datetime arm of the parse_val() function in parser.rs

Chatty chat chat

Hey @ehuss,

Thanks for you ping on Reddit and offering your help!
As you see I've opened some issues here; of course docs are also welcome, as are any passes you want to do over the @todos @cleanups, etc that I have in the code.

Keep in mind though, all of this is prototype quality at best and the code has some glaring problems. Hilariously and to my shame, tables are currently parsed as many times as their position in the document, for instance. Will fix Soon.

Also, may I ask what you want this project for?

Better testing of strings

tests/strings.toml should have more cases added:

Non ASCII characters
Trying harder to break multiline basic strings
Making sure escaped characters work correctly

Basically it should test everything the TOML standard says they're for

One nightly dependency away from testing in stable Rust

I've noticed that, with the latest stable version of Rust (1.22.1), this entire project can be compiled. The only thing that forces us to use a nightly compiler is the dependence on the test-case-derive crate, and specifically its use of #![feature(proc_macro)]. (I confirmed it by getting rid of that dependency and successfully running the tests with the stable compiler.) It would be nice to rely only on stable features as much as possible. We can already use the stable compiler for release builds, but using two different versions might lead to some unexpected bugs.

That said, if there's a reason for us to use test-case-derive, that's fine. But if it's okay to get rid of it, I don't mind restructuring the test code for that purpose. The changes that would be needed at this point are pretty minor.

Appending may write to same line as last value

If a TOML document does not contain a newline at the end of the file, appending to it will cause the new KV to be on the same line as the last KV.
Add a check for this and add a newline to the last existing value if necessary.

Merge consecutive whitespace

Currently the parser ends whitespace whenever a newline is reached. When appending a WS item, check that it does not follow another WS item,and merge them together if it does.

Look for the todo in parse_item(), currently on line 135.

newlines not reproduced correctly

When serializing, it currently assumes that Windows is CRLF. However, I never use that style of newlines on Windows, and thus the tests are currently failing. It should instead capture the newline style while parsing, and ignore the current platform.

Fuzzer panic: byte index outside of char boundary

The fuzzer from #40 finds a panic triggered by a byte index that is outside a char boundary.

Adding this to the parser tests:

#[test]
fn issue41() {
    let text = ::std::str::from_utf8(b"\'\'fb\'\xee\x9d\xbd").unwrap();
    let _ = Parser::new(text).parse();
}

fails with

---- parser::tests::issue41 stdout ----
	thread 'parser::tests::issue41' panicked at 'byte index 7 is not a char boundary; it is inside '\u{e77d}' (bytes 5..8) of `''fb'`', src/libcore/str/mod.rs:2235:4
stack backtrace:
   0:        0x10ad515eb - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::h42b4ce0b98574016
   1:        0x10ad4a2ce - std::sys_common::backtrace::print::h293f72b3c32dfa5b
   2:        0x10ad520c0 - _ZN3std9panicking12default_hook28_$u7b$$u7b$closure$u7d$$u7d$17haa05421012ac6c20E.llvm.5B51A422
   3:        0x10ad51dc4 - _ZN3std9panicking12default_hook17h2d0d0ff0f27771f9E.llvm.5B51A422
   4:        0x10ad52526 - std::panicking::rust_panic_with_hook::h2a8b5b7a95208f5a
   5:        0x10ad5237e - _ZN3std9panicking11begin_panic17h1179f738121e3414E.llvm.5B51A422
   6:        0x10ad522d3 - std::panicking::begin_panic_fmt::h4986ee369ba2ddac
   7:        0x10ad52242 - rust_begin_unwind
   8:        0x10ad94273 - core::panicking::panic_fmt::hd9b79e885de0143c
   9:        0x10ad98ef4 - core::str::slice_error_fail::h606b34e3191fab74
  10:        0x10acb1cbe - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::{{closure}}::h9b2035a79c64f1dc
  11:        0x10acc2f90 - <core::option::Option<T>>::unwrap_or_else::h9be067b5aa4c4510
  12:        0x10acbdf6b - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::h195fcc6753118e5e
  13:        0x10acb1c6c - core::str::traits::<impl core::ops::index::Index<core::ops::range::Range<usize>> for str>::index::h779aae528849c22f
  14:        0x10accb116 - Molten::parser::Parser::parse_val::hc4a6ac51f5c68f43
  15:        0x10acca8de - Molten::parser::Parser::parse_key_value::hb46cee7de7a0d92c
  16:        0x10acc9f9b - Molten::parser::Parser::parse_item::h9ead6e84efb9d9e8
  17:        0x10acc8719 - Molten::parser::Parser::parse::h70b405a15af60225
  18:        0x10acb0ba8 - Molten::parser::tests::issue41::h7425247ee675da6e
  19:        0x10ad05961 - _ZN42_$LT$F$u20$as$u20$test..FnBox$LT$T$GT$$GT$8call_box17h07fcc9c8a31cf663E.llvm.B477B8A1
  20:        0x10ad6580e - __rust_maybe_catch_panic
  21:        0x10acf6200 - std::sys_common::backtrace::__rust_begin_short_backtrace::h727bf0093a90e1fc
  22:        0x10acfb9f7 - _ZN3std9panicking3try7do_call17hc2fcbcc92b537a0fE.llvm.1CF3EA36
  23:        0x10ad6580e - __rust_maybe_catch_panic
  24:        0x10ad11fb1 - _ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h765b683cac353d8aE.llvm.4B1D5347
  25:        0x10ad4ccf7 - std::sys_common::thread::start_thread::h4ad86c0b3fefeff0
  26:        0x10ad579f8 - _ZN3std3sys4unix6thread6Thread3new12thread_start17h333ed09407a7bdcdE.llvm.9AF12AEA
  27:     0x7fffa0c4493a - _pthread_body
  28:     0x7fffa0c44886 - _pthread_start

Access items by path

see toml-rs/toml-rs#223

API

Discuss what features the API should have.

As far as implementing it, we should start with:

Removal of elements
Simple appending of values at the end of containers
Simple builders for said values

Any code that implements this needs to be vigorously tested as it's obviously a fragile part of the library.

#7 should be looked into before this

Whishlist later on includes:

Insertion at any index, or before/after the specified item
Full control over trivia such as comments, separators and indentation
Sorting of containers
Auto indent
Ability to canonicalize the file
...?

Project needs a LICENSE

MIT or dual MIT/Apache-2.0 would be my personal preference, and seems to be common in the Rust community.

Normalize table names before comparing them

As per this paragraph in the spec, table names can include whitespace, which should be stripped before comparing them together.

These comparisons occur in 3(?) places:

Parser::is_child()
Parser::parse_table()
Parser::parse_aot()

The normalization is only for comparisons, the actual value must still include the raw name so it can be reproduced exactly.

Setup CI

All AoT sibling names are formated like the first one

The parser currently assumes that all sibling tables in an AoT have their names/keys formatted the same way. Since this may not be the case, we need to track the raw name of each of them, and use that in the display logic.

Add tests for error position reporting

As of #33, errors can report the position where they occurred; Add some tests for this using some malformed TOML files.

Alternative AST design

Hi!

I'd like to share a possible alternative design for the lossless AST. I think the current Molten design is great, but it's always better to have more options available :) I maintain TOML plugin for the IntelliJ platform which uses this AST representation, and such representation is pretty typical for IntelliJ, and works great if you care about all trivia.

The idea is to create a two-layered tree structure. As the base, there's a homogeneous "untuped" tree which stores only spans in the original file, and the typed, "Abstract" representation is layered on top of it.

In code, the first layer look like this (super simplified)

// An "enum", representing all of the syntactic constructs of a particular language, 
// both simple tokens and composite elements.
struct NodeType(u32); 

//For Toml, there would be a bunch of constants like 
// const WHITESPACE: NodeType = NodeType(0);
// const EQALS_SIGN = NodeType(1);
// ....
// const KEY_VALUE_PAIR = NodeType(92);

// The *single* node type, which has a type, a text range, and a list of children. 
struct Node {
    type: NodeType,
    text_range: (usize, usize),
    children: Vec<Node>,
}

So, the parse function looks like fn parse(input: &str) -> Node. The Node representation is naturally lossless, because it only contains spans. Moreover, you can now write functions easily to traverse and edit the tree without the need to solve the expression problem. Its really nice to write code formatters against this representation, because you can make the rule-based( .around(NodeType).ensureSpaces(u32)).

However, this representation is not really convenient to work with because it is untyped. But you can add types! The idea is to create a separate struct for each important NodeType, such that the struct holds only Node with this NodeType. So, something like this:

trait AstNode {
    const TYPE: NodeType;
    fn new(node: Node) -> Self;
    fn node(&self) -> Node
}

struct TomlKeyValuePair(Node);

impl AstNode for TomlKeyValuePair {
    fn new(node: Node) {
        assert_eq!(node.type, KEY_VALUE_PAIR); // here's the transition between  untyped and typed worlds
        TomlKeyValuePair(node)
    }
}

struct TomlKey(Node);
impl AstNode for TomlKey { ... }

struct TomlValue(Node);
impl AstNode for TomlValue { ... }

impl TomlKeyValuePair {
    fn key(&self) -> TomlKey {
        TomlKey(self.0.children().find(|n| n.type == KEY).unwrap())
    }

    fn value(&self) -> TomlValue {
        TomlValue(self.0.children().find(|n| n.type == VALUE).unwrap())
    }
}

That is, abstract tree is a zero-cost wrapper around bare syntax tree which adds proofs that a particular node indeed holds a particular construct.

Hope this was interesting for you! :)

Indexing containers should not include trivia

Indexing a container currently takes whitespace and comments into account. It shouldn't.

Thought: Perhaps we could have a wrapper type around usize called Exhaustive or Trivia that does include them when used as the index? People can just use nth() on the exhaustive iterator otherwise.

All the iterator code (in container.rs) may need a glance over actually; it's prototype quality and may be backwards.