leopoldarkham / molten Goto Github PK
View Code? Open in Web Editor NEW[WIP] Molten - Style-preserving TOML parser.
License: Apache License 2.0
[WIP] Molten - Style-preserving TOML parser.
License: Apache License 2.0
Currently I think visibility might be a mess.
pub(crate)
If someone can go through everything and make sure scoping and visibility are correct, that'd be great!
The fuzzer from #40 finds a panic triggered by a byte index that is out of bounds. This may be related to #41.
Adding this to the parser tests:
#[test]
fn issue42() {
let text = ::std::str::from_utf8(b"\'\nv\'f%\nb").unwrap();
let _ = Parser::new(text).parse();
}
fails with
---- parser::tests::issue42 stdout ----
thread 'parser::tests::issue42' panicked at 'byte index 9 is out of bounds of `'
v'f%
b`', src/libcore/str/mod.rs:2218:8
stack backtrace:
0: 0x1048c54fb - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::h42b4ce0b98574016
1: 0x1048be1de - std::sys_common::backtrace::print::h293f72b3c32dfa5b
2: 0x1048c5fd0 - _ZN3std9panicking12default_hook28_$u7b$$u7b$closure$u7d$$u7d$17haa05421012ac6c20E.llvm.5B51A422
3: 0x1048c5cd4 - _ZN3std9panicking12default_hook17h2d0d0ff0f27771f9E.llvm.5B51A422
4: 0x1048c6436 - std::panicking::rust_panic_with_hook::h2a8b5b7a95208f5a
5: 0x1048c628e - _ZN3std9panicking11begin_panic17h1179f738121e3414E.llvm.5B51A422
6: 0x1048c61e3 - std::panicking::begin_panic_fmt::h4986ee369ba2ddac
7: 0x1048c6152 - rust_begin_unwind
8: 0x104908183 - core::panicking::panic_fmt::hd9b79e885de0143c
9: 0x10490ce04 - core::str::slice_error_fail::h606b34e3191fab74
10: 0x10482138e - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::{{closure}}::h9b2035a79c64f1dc
11: 0x1048376e0 - <core::option::Option<T>>::unwrap_or_else::h9be067b5aa4c4510
12: 0x104831cab - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::h195fcc6753118e5e
13: 0x10482133c - core::str::traits::<impl core::ops::index::Index<core::ops::range::Range<usize>> for str>::index::h779aae528849c22f
14: 0x10483f026 - Molten::parser::Parser::parse_val::hc4a6ac51f5c68f43
15: 0x10483e7ee - Molten::parser::Parser::parse_key_value::hb46cee7de7a0d92c
16: 0x10483deab - Molten::parser::Parser::parse_item::h9ead6e84efb9d9e8
17: 0x10483c629 - Molten::parser::Parser::parse::h70b405a15af60225
18: 0x1048339c8 - Molten::parser::tests::issue42::hea1b53b87d26c939
19: 0x104879871 - _ZN42_$LT$F$u20$as$u20$test..FnBox$LT$T$GT$$GT$8call_box17h07fcc9c8a31cf663E.llvm.B477B8A1
20: 0x1048d971e - __rust_maybe_catch_panic
21: 0x10486a110 - std::sys_common::backtrace::__rust_begin_short_backtrace::h727bf0093a90e1fc
22: 0x10486f907 - _ZN3std9panicking3try7do_call17hc2fcbcc92b537a0fE.llvm.1CF3EA36
23: 0x1048d971e - __rust_maybe_catch_panic
24: 0x104885ec1 - _ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h765b683cac353d8aE.llvm.4B1D5347
25: 0x1048c0c07 - std::sys_common::thread::start_thread::h4ad86c0b3fefeff0
26: 0x1048cb908 - _ZN3std3sys4unix6thread6Thread3new12thread_start17h333ed09407a7bdcdE.llvm.9AF12AEA
27: 0x7fffa0c4493a - _pthread_body
28: 0x7fffa0c44886 - _pthread_start
Currently, everything is allocated into a String; it may not matter much for the size of TOML files, but I'd still like to move to Cow or similar where possible.
Self-explanatory, it's a better description of what it contains.
There should be a test suite for indexing all indexable values. Indexing was broken by #15 so it will need fixing to make the tests green!
Initially, we can just convert the current parser index into a (column, line) pair.
Later on, it would be nice to backtrack to the beginning of the item causing a problem.
Reporting a span consisting of two (col, line) pairs is a goal down the line but not a priority. (Good for highlighting errors in an editor, for instance)
Pretty printing the parser's result... isn't. To make debugging easier, we should override it so it uses less frivolous newlines and indentations, and shows a more ad-hoc representation of containers.
The floats
and integers
tests are currently failing: The parser discards any whitespace that follows the last digit (see failing tests).
Seems to only be a problem in a keyvalue pair, at the end of the line.
Parser::new()
should become Parser::from()
and accept anything that AsRef's to &str
Parser::from_file()
should be added and accept anything that AsRef's to Path
TOMLDocument::save_to()
should be added and accept anything that AsRef's to Path
TOMLDocument::save()
should be added, it overwrites the source file if the parser was created through the Parser::from_file()
method and returns an error otherwise.The tests will need to be updated to use these new methods
See this comment
A bare key must be non empty; The parser should enforce this and raise an appropriate issue when needed.
The code assumes Windows-style line separators (\r\n), but on UNIX-based OSes the tests fail because the line separator is assumed to be (\n) and the generated strings differ from the test sources. The code should use a constant set based on the operating system to allow the comparisons to work regardless of the host operating system.
Currently, the objects in the AST use exactly the names given to them in the file. That's good, except that it includes things before a dot. So, for instance, the following TOML file
foo.bar = 1
[foo]
bar = 2
is parsed in such a way that it becomes equivalent to this JSON string:
{
"foo.bar": 1,
"foo": {"bar": 2}
}
We can fix this by storing only the last segment of any key, and following the paths to the Container
s that hold them. And if a table doesn't exist, it should be created implicitly so that it can hold the container.
So, the process for parsing the TOML example above would be roughly the following:
foo
and see that it is followed by a dot.Container
for a table called foo
. Since it isn't there, make a new table, marking it as implicitly defined.bar
with a value of 1 to the new Container
.[foo]
. See that the active container already contains something called foo
, but don't complain because we're explicitly defining a previously implicit table.foo
's Container
the active one.bar
and find that it is already in the active Container
. Return a DuplicateKey
error.I'm currently working on fixing this.
This is to discuss how error reporting should work.
Thoughts:
error-chain
, but I've seen people raise concerns about it that have made me shy away from it. Thoughts on whether this is justified and alternatives if it is?The spec states:
"Similar to integers, you may use underscores to enhance readability. Each underscore must be surrounded by at least one digit."
I think we currently allow leading and trailing underscores: This should be fixed.
Auditing that part of the code to makes sure there aren't any other issues regarding what we allow/disallow regarding ints and floats would be a good thing in general.
This should probably be looked at together with #2
The code for this is in the Int, float, datetime arm of the parse_val()
function in parser.rs
Hey @ehuss,
Thanks for you ping on Reddit and offering your help!
As you see I've opened some issues here; of course docs are also welcome, as are any passes you want to do over the @todo
s @cleanup
s, etc that I have in the code.
Keep in mind though, all of this is prototype quality at best and the code has some glaring problems. Hilariously and to my shame, tables are currently parsed as many times as their position in the document, for instance. Will fix Soon.
Also, may I ask what you want this project for?
tests/strings.toml
should have more cases added:
Basically it should test everything the TOML standard says they're for
I've noticed that, with the latest stable version of Rust (1.22.1), this entire project can be compiled. The only thing that forces us to use a nightly compiler is the dependence on the test-case-derive
crate, and specifically its use of #![feature(proc_macro)]
. (I confirmed it by getting rid of that dependency and successfully running the tests with the stable compiler.) It would be nice to rely only on stable features as much as possible. We can already use the stable compiler for release builds, but using two different versions might lead to some unexpected bugs.
That said, if there's a reason for us to use test-case-derive
, that's fine. But if it's okay to get rid of it, I don't mind restructuring the test code for that purpose. The changes that would be needed at this point are pretty minor.
If a TOML document does not contain a newline at the end of the file, appending to it will cause the new KV to be on the same line as the last KV.
Add a check for this and add a newline to the last existing value if necessary.
Currently the parser ends whitespace whenever a newline is reached. When appending a WS item, check that it does not follow another WS item,and merge them together if it does.
Look for the todo
in parse_item()
, currently on line 135.
When serializing, it currently assumes that Windows is CRLF. However, I never use that style of newlines on Windows, and thus the tests are currently failing. It should instead capture the newline style while parsing, and ignore the current platform.
The fuzzer from #40 finds a panic triggered by a byte index that is outside a char boundary.
Adding this to the parser tests:
#[test]
fn issue41() {
let text = ::std::str::from_utf8(b"\'\'fb\'\xee\x9d\xbd").unwrap();
let _ = Parser::new(text).parse();
}
fails with
---- parser::tests::issue41 stdout ----
thread 'parser::tests::issue41' panicked at 'byte index 7 is not a char boundary; it is inside '\u{e77d}' (bytes 5..8) of `''fb'๎ฝ`', src/libcore/str/mod.rs:2235:4
stack backtrace:
0: 0x10ad515eb - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::h42b4ce0b98574016
1: 0x10ad4a2ce - std::sys_common::backtrace::print::h293f72b3c32dfa5b
2: 0x10ad520c0 - _ZN3std9panicking12default_hook28_$u7b$$u7b$closure$u7d$$u7d$17haa05421012ac6c20E.llvm.5B51A422
3: 0x10ad51dc4 - _ZN3std9panicking12default_hook17h2d0d0ff0f27771f9E.llvm.5B51A422
4: 0x10ad52526 - std::panicking::rust_panic_with_hook::h2a8b5b7a95208f5a
5: 0x10ad5237e - _ZN3std9panicking11begin_panic17h1179f738121e3414E.llvm.5B51A422
6: 0x10ad522d3 - std::panicking::begin_panic_fmt::h4986ee369ba2ddac
7: 0x10ad52242 - rust_begin_unwind
8: 0x10ad94273 - core::panicking::panic_fmt::hd9b79e885de0143c
9: 0x10ad98ef4 - core::str::slice_error_fail::h606b34e3191fab74
10: 0x10acb1cbe - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::{{closure}}::h9b2035a79c64f1dc
11: 0x10acc2f90 - <core::option::Option<T>>::unwrap_or_else::h9be067b5aa4c4510
12: 0x10acbdf6b - core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::Range<usize>>::index::h195fcc6753118e5e
13: 0x10acb1c6c - core::str::traits::<impl core::ops::index::Index<core::ops::range::Range<usize>> for str>::index::h779aae528849c22f
14: 0x10accb116 - Molten::parser::Parser::parse_val::hc4a6ac51f5c68f43
15: 0x10acca8de - Molten::parser::Parser::parse_key_value::hb46cee7de7a0d92c
16: 0x10acc9f9b - Molten::parser::Parser::parse_item::h9ead6e84efb9d9e8
17: 0x10acc8719 - Molten::parser::Parser::parse::h70b405a15af60225
18: 0x10acb0ba8 - Molten::parser::tests::issue41::h7425247ee675da6e
19: 0x10ad05961 - _ZN42_$LT$F$u20$as$u20$test..FnBox$LT$T$GT$$GT$8call_box17h07fcc9c8a31cf663E.llvm.B477B8A1
20: 0x10ad6580e - __rust_maybe_catch_panic
21: 0x10acf6200 - std::sys_common::backtrace::__rust_begin_short_backtrace::h727bf0093a90e1fc
22: 0x10acfb9f7 - _ZN3std9panicking3try7do_call17hc2fcbcc92b537a0fE.llvm.1CF3EA36
23: 0x10ad6580e - __rust_maybe_catch_panic
24: 0x10ad11fb1 - _ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h765b683cac353d8aE.llvm.4B1D5347
25: 0x10ad4ccf7 - std::sys_common::thread::start_thread::h4ad86c0b3fefeff0
26: 0x10ad579f8 - _ZN3std3sys4unix6thread6Thread3new12thread_start17h333ed09407a7bdcdE.llvm.9AF12AEA
27: 0x7fffa0c4493a - _pthread_body
28: 0x7fffa0c44886 - _pthread_start
Discuss what features the API should have.
As far as implementing it, we should start with:
Any code that implements this needs to be vigorously tested as it's obviously a fragile part of the library.
#7 should be looked into before this
Whishlist later on includes:
MIT or dual MIT/Apache-2.0 would be my personal preference, and seems to be common in the Rust community.
As per this paragraph in the spec, table names can include whitespace, which should be stripped before comparing them together.
These comparisons occur in 3(?) places:
Parser::is_child()
Parser::parse_table()
Parser::parse_aot()
The normalization is only for comparisons, the actual value must still include the raw name so it can be reproduced exactly.
The parser currently assumes that all sibling tables in an AoT have their names/keys formatted the same way. Since this may not be the case, we need to track the raw name of each of them, and use that in the display logic.
As of #33, errors can report the position where they occurred; Add some tests for this using some malformed TOML files.
Hi!
I'd like to share a possible alternative design for the lossless AST. I think the current Molten design is great, but it's always better to have more options available :) I maintain TOML plugin for the IntelliJ platform which uses this AST representation, and such representation is pretty typical for IntelliJ, and works great if you care about all trivia.
The idea is to create a two-layered tree structure. As the base, there's a homogeneous "untuped" tree which stores only spans in the original file, and the typed, "Abstract" representation is layered on top of it.
In code, the first layer look like this (super simplified)
// An "enum", representing all of the syntactic constructs of a particular language,
// both simple tokens and composite elements.
struct NodeType(u32);
//For Toml, there would be a bunch of constants like
// const WHITESPACE: NodeType = NodeType(0);
// const EQALS_SIGN = NodeType(1);
// ....
// const KEY_VALUE_PAIR = NodeType(92);
// The *single* node type, which has a type, a text range, and a list of children.
struct Node {
type: NodeType,
text_range: (usize, usize),
children: Vec<Node>,
}
So, the parse function looks like fn parse(input: &str) -> Node
. The Node
representation is naturally lossless, because it only contains spans. Moreover, you can now write functions easily to traverse and edit the tree without the need to solve the expression problem. Its really nice to write code formatters against this representation, because you can make the rule-based( .around(NodeType).ensureSpaces(u32)
).
However, this representation is not really convenient to work with because it is untyped. But you can add types! The idea is to create a separate struct for each important NodeType
, such that the struct holds only Node
with this NodeType
. So, something like this:
trait AstNode {
const TYPE: NodeType;
fn new(node: Node) -> Self;
fn node(&self) -> Node
}
struct TomlKeyValuePair(Node);
impl AstNode for TomlKeyValuePair {
fn new(node: Node) {
assert_eq!(node.type, KEY_VALUE_PAIR); // here's the transition between untyped and typed worlds
TomlKeyValuePair(node)
}
}
struct TomlKey(Node);
impl AstNode for TomlKey { ... }
struct TomlValue(Node);
impl AstNode for TomlValue { ... }
impl TomlKeyValuePair {
fn key(&self) -> TomlKey {
TomlKey(self.0.children().find(|n| n.type == KEY).unwrap())
}
fn value(&self) -> TomlValue {
TomlValue(self.0.children().find(|n| n.type == VALUE).unwrap())
}
}
That is, abstract tree is a zero-cost wrapper around bare syntax tree which adds proofs that a particular node indeed holds a particular construct.
Hope this was interesting for you! :)
Indexing a container currently takes whitespace and comments into account. It shouldn't.
Thought: Perhaps we could have a wrapper type around usize called Exhaustive
or Trivia
that does include them when used as the index? People can just use nth()
on the exhaustive iterator otherwise.
All the iterator code (in container.rs
) may need a glance over actually; it's prototype quality and may be backwards.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.