vshymanskyy / muon Goto Github PK
View Code? Open in Web Editor NEWµON - a compact and simple binary object notation
License: Apache License 2.0
µON - a compact and simple binary object notation
License: Apache License 2.0
Just for debug purposes, sometimes it is handy just print it to console. So it is useful to have some cheap (on the fly, without full decoding to separate object; maybe in order to run on microcontroller for example) toString method.
Json is an option, but not sure whether it can be done cheap (by memory).
But as far as I understand, not every muon can be represented as json.
Posit is a number format that is similar to IEEE 754 format (floating point numbers).
The Posit Standard (2022) has been ratified by the Posit Working Group: https://posithub.org
The idea is to add a new posit
tag that can apply to u8
, u16
and u32
typed values and arrays, which are then treated as the corresponding Posits
.
Also, this could be a good opportunity to add support for arbitrary-precision floats, but is there a rationale?
If I read the Python code correctly, the LEB128 lengths in strings and typed arrays are unsigned LEB128, and for the 0xBB
typed values we use signed LEB128, but it isn't specified anywhere else. Maybe that should be clarified in the images and presentation.
Given #5 it is possible for strings to contain \u0000
values. With that in mind should I assume all 8C tags length-encoded instead of zero-terminated? Would it make sense to have a separate tag for adding zero-terminated strings and for length-encoded strings to the LRU?
Apologies for the wall of text, it's because I really like the idea of Muon :)
TL;DR: key questions are bolded in the text below
So bit of context: I'm trying to write a JavaScript implementation, since the format is so elegantly simple that I feel I can achieve a basic version of it. My first goal is to have "feature parity" with how JavaScript handles JSON. That is: being able to roundtrip any object that you could also send through JSON.stringify
and get back from JSON.parse
, and ignoring the things that it can't. After that I'll worry about the extra things Muon supports.
Having said that, the fact that Muon has more ways to encode numbers was too tempting to not play around with for size savings. The way I have handled numbers so far is assuming that everything is a double unless explicitly made a BigInt (so basically how JavaScript handles numbers), and reserving i64
, u64
and LEB128
for those BigInt values. This lets me use all the other number types the AX
and BX
rows to always pick the minimum number of bytes necessary to encode numbers, e.g.
[8, 16, 1/16, 0.1] => 90 A8 B4 10 B9 00 00 80 3D BA 9A 99 99 99 99 99 B9 3F 91
| | | | | |
| | | | | List end
| | | | f64 approximation of 0.1
| | | f32 encoding of 1/16
| | u8 encoding of 16
| direct encoding of 8
List start
This is not a problem when just roundtripping JS-to-JS, since all values just gets promoted back to doubles in the end.
But now imagine we're sending data between Python and JS code through Muon. Python uses variable sized numbers under the hood, right? One could say all integers are "BigInt" and all floats are doubles (I think), unless one is working with NumPy. The example Python encoder from the slides either directly encodes 0-9, or uses LEB128 for all other integers.
Imagine we have a list of integers between 0 and 1000 [some value bigger than Number.MAX_SAFE_INTEGER
] in Python that we encode this way, then decode in my JS implementation. We would end up with an array of mixed doubles and BigInt values. So one number type gets converted to two different ones.
One way to handle this would be to say that a JS implementation of Muon has to convert LEB128 back to doubles if it safely fits in a double, but that also potentially leads to issues: say that I start in JavaScript with a list that contains BigInts, some of which could also be safely converted to doubles. First we serialize this list. Let's assume this will use LEB128 encoding because of the BigInt type, like I have so far. Now we deserialize this list in JS. Because of the rule we just established some of the BigInt values will turn into doubles - we change the number types again!
So we basically have two needs that are a little bit at odds:
I think the best summary of this question is: how should Muon handle the different ways languages handle number types when transmitting data between these languages?
For now, for my own implementation I will prioritize 1 over 2 (because it's a toy implementation and I'm not planning to interact with Python in my own use-cases).
PS: I'm sure this question has come up with other encodings that have support for more than just doubles, so maybe it's worth looking up what the arguments + conclusions were in those situations?
I have several questions relating to dicts:
... or should we remove them for the sake of simplification?
alternative: move Attrubutes to Muon document level, so it can only appear in the beginning of the file or stream.
Sometimes, key of the dictionary can be an integer (like id or timestamp). And it could be usefull not to stringify them.
As far as I understood, it is enough to allow Ax, Bx for the keys to implement this ... does not require any major specification change.
This project is on an early stage of development, and generally should be treated as RFC (Request for comments). If you have any ideas/comments, please feel free to post them here.
Not sure whether this feature is a goal, but something like protobuf can provide is useful.
At the end, you build and parse messages from/to some memory objects/structs. And in would be useful to have an ability not to write serialization/deserialization code by hand, but generate it.
With possible backward compatible changes in schema in mind, generator like protobuf looks promising: you write schema, generator builds your classes/structs, for example, for microcontoller, backend and mobile app consistantly.
On the other hand, something like json serialization usually done is also a way to go.
... instead of the dedicated marker.
Fixed-size string will remain non-0-terminated.
When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]
The encoding diagram has sections for both attr
and tags
but doesn't explain what they are or how they are encoded. Am I missing something really obvious?
I don't understand what kv
is supposed to be, is it the same as the dict
kv
? What values are valid for tag
? and val
? several tag encodings are mentioned in the table of values, but appear to only be allowable outside of the main object?
It's not particularly clear that the encoding diagram is recursive, but does make sense once you get to the choices for lists and dicts.
muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.
the jsonl separator \n is hard to beat, but support for raw data makes that impossible.
I have found the AVRO strategy of picking a random 16-byte delimiter is nice.
I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.
Would be good to have a list in the README of known muon libraries in various programming languages.
Thanks for publishing the format, it's brilliantly simple!
Thanks for this work. You mention a compact form in the presentation, it would be very good to define that further as a minimal deterministic encoding.
https://datatracker.ietf.org/doc/html/rfc8949#section-4.2 may be interesting if you haven't come across it.
The README lack a section explaining why use µon. In which case it should be preferred to other serialization formats and why.
Say I have a nested object that for some reason contains the strings "parallelepiped" and "therizinosaurus" many times, and I know this. Which of the two following ways of encoding an 8C
tagged list is correct?
8C 90 <utf8 "parallelepiped"> 00 <utf8 "therizinosaurus"> 00 91
90 8C <utf8 "parallelepiped"> 00 8C <utf8 "therizinosaurus"> 00 91
(I'm pretending #12 doesn't exist and that 8C-tagged strings are zero-terminated, will update example once that issue is resolved)
My guess is that they are both correct, but the former only adds the strings to the LRU without encoding a list into the object, while the latter encodes a list of strings that also get added to the LRU. Whatever the case, it might be good to be explicit about it in the docs somewhere.
Could we plainly chain (concatenate as cat
does) muon on-the-wire data and feed them into muon reader/parser?
(in the past I had fun thinking about somewhat similar idea but with not so much convincing result 😉)
Originally posted by @dumblob in #5 (comment)
Hello @vshymanskyy
Thanks for developing the muon
and TinyGSM
libraries
I'm a Macker and I use Thinger.io in my projects.
They developed PSON (A Serialization Format for IoT Sensor Networks) and the IOTMP protocol.
I hope the information about PSON
and IOTMP
can help in muon
development.
Thank you for your work
PSON Article: https://www.mdpi.com/1424-8220/21/13/4559/htm
PSON Github: https://github.com/thinger-io/Protoson
IOTMP Article: https://www.mdpi.com/1424-8220/19/5/1044/htm
IOTMP Github: https://github.com/thinger-io/IOTMP
Integration: PSON
- IOTMP
- ARDUINO
and TINYGSM
with Thinger.io
Arduino-Library: https://github.com/thinger-io/Arduino-Library
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.