vshymanskyy / muon Goto Github PK

µON - a compact and simple binary object notation

License: Apache License 2.0

Python 100.00%

muon's Issues

Standard text representation

Just for debug purposes, sometimes it is handy just print it to console. So it is useful to have some cheap (on the fly, without full decoding to separate object; maybe in order to run on microcontroller for example) toString method.

Json is an option, but not sure whether it can be done cheap (by memory).
But as far as I understand, not every muon can be represented as json.

Add support for `unums/posits`, a new floating point format

Posit is a number format that is similar to IEEE 754 format (floating point numbers).
The Posit Standard (2022) has been ratified by the Posit Working Group: https://posithub.org

The idea is to add a new posit tag that can apply to u8, u16 and u32 typed values and arrays, which are then treated as the corresponding Posits.
Also, this could be a good opportunity to add support for arbitrary-precision floats, but is there a rationale?

Are we using signed or unsigned LEB128?

If I read the Python code correctly, the LEB128 lengths in strings and typed arrays are unsigned LEB128, and for the 0xBB typed values we use signed LEB128, but it isn't specified anywhere else. Maybe that should be clarified in the images and presentation.

Are 8C tagged strings lenght encoded?

Given #5 it is possible for strings to contain \u0000 values. With that in mind should I assume all 8C tags length-encoded instead of zero-terminated? Would it make sense to have a separate tag for adding zero-terminated strings and for length-encoded strings to the LRU?

Consider using `prefix code` and `zigzag` encoding for variable-length integers

Question regarding numbers being passed between JS and Python versions and what it means for their types

Apologies for the wall of text, it's because I really like the idea of Muon :)

TL;DR: key questions are bolded in the text below

So bit of context: I'm trying to write a JavaScript implementation, since the format is so elegantly simple that I feel I can achieve a basic version of it. My first goal is to have "feature parity" with how JavaScript handles JSON. That is: being able to roundtrip any object that you could also send through JSON.stringify and get back from JSON.parse, and ignoring the things that it can't. After that I'll worry about the extra things Muon supports.

Having said that, the fact that Muon has more ways to encode numbers was too tempting to not play around with for size savings. The way I have handled numbers so far is assuming that everything is a double unless explicitly made a BigInt (so basically how JavaScript handles numbers), and reserving i64, u64 and LEB128 for those BigInt values. This lets me use all the other number types the AX and BX rows to always pick the minimum number of bytes necessary to encode numbers, e.g.

[8, 16, 1/16, 0.1] => 90 A8 B4 10 B9 00 00 80 3D BA 9A 99 99 99 99 99 B9 3F 91
                       |  |  |     |              |                          |
                       |  |  |     |              |                          List end
                       |  |  |     |              f64 approximation of 0.1
                       |  |  |     f32 encoding of 1/16
                       |  |  u8 encoding of 16
                       |  direct encoding of 8
                       List start

This is not a problem when just roundtripping JS-to-JS, since all values just gets promoted back to doubles in the end.

But now imagine we're sending data between Python and JS code through Muon. Python uses variable sized numbers under the hood, right? One could say all integers are "BigInt" and all floats are doubles (I think), unless one is working with NumPy. The example Python encoder from the slides either directly encodes 0-9, or uses LEB128 for all other integers.

Imagine we have a list of integers between 0 and ~~1000~~ [some value bigger than Number.MAX_SAFE_INTEGER] in Python that we encode this way, then decode in my JS implementation. We would end up with an array of mixed doubles and BigInt values. So one number type gets converted to two different ones.

One way to handle this would be to say that a JS implementation of Muon has to convert LEB128 back to doubles if it safely fits in a double, but that also potentially leads to issues: say that I start in JavaScript with a list that contains BigInts, some of which could also be safely converted to doubles. First we serialize this list. Let's assume this will use LEB128 encoding because of the BigInt type, like I have so far. Now we deserialize this list in JS. Because of the rule we just established some of the BigInt values will turn into doubles - we change the number types again!

So we basically have two needs that are a little bit at odds:

serialization/deserialization within the same language should not result in type changes
serialization in one language and deserialization in another should result in predictable number types

I think the best summary of this question is: how should Muon handle the different ways languages handle number types when transmitting data between these languages?

For now, for my own implementation I will prioritize 1 over 2 (because it's a toy implementation and I'm not planning to interact with Python in my own use-cases).

PS: I'm sure this question has come up with other encodings that have support for more than just doubles, so maybe it's worth looking up what the arguments + conclusions were in those situations?

Handling of duplicate keys in dicts

I have several questions relating to dicts:

What is the expected handling for dicts where the same key exists multiple times?
Should later values replace earlier ones? Is it implementation defined?
Is there any expectation of what sort of data structure should represent a dict internally - such as an ordered key-value map, or a hash map?
Does/should/can the order of keys matter in an encoding?

Are `Attributes` useful enough?

... or should we remove them for the sake of simplification?

alternative: move Attrubutes to Muon document level, so it can only appear in the beginning of the file or stream.

Allow int's to be dictionary keys

Sometimes, key of the dictionary can be an integer (like id or timestamp). And it could be usefull not to stringify them.

As far as I understood, it is enough to allow Ax, Bx for the keys to implement this ... does not require any major specification change.

Ideas and comments

This project is on an early stage of development, and generally should be treated as RFC (Request for comments). If you have any ideas/comments, please feel free to post them here.

Serialization/deserialization

Not sure whether this feature is a goal, but something like protobuf can provide is useful.

At the end, you build and parse messages from/to some memory objects/structs. And in would be useful to have an ability not to write serialization/deserialization code by hand, but generate it.

With possible backward compatible changes in schema in mind, generator like protobuf looks promising: you write schema, generator builds your classes/structs, for example, for microcontoller, backend and mobile app consistantly.

On the other hand, something like json serialization usually done is also a way to go.

Use `size` tag for fixed-length strings

... instead of the dedicated marker.
Fixed-size string will remain non-0-terminated.

Should LRU cache apply to arbitrary objects, not only strings?

Handling of strings with nul bytes in them

When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]

Explanation of what tags actually are

The encoding diagram has sections for both attr and tags but doesn't explain what they are or how they are encoded. Am I missing something really obvious?

I don't understand what kv is supposed to be, is it the same as the dict kv? What values are valid for tag? and val? several tag encodings are mentioned in the table of values, but appear to only be allowable outside of the main object?

It's not particularly clear that the encoding diagram is recursive, but does make sense once you get to the choices for lists and dicts.

Resynchronization

muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.

the jsonl separator \n is hard to beat, but support for raw data makes that impossible.

I have found the AVRO strategy of picking a random 16-byte delimiter is nice.

I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.

A list of libraries in other languages?

Would be good to have a list in the README of known muon libraries in various programming languages.

Thanks for publishing the format, it's brilliantly simple!

Deterministic encoding

Thanks for this work. You mention a compact form in the presentation, it would be very good to define that further as a minimal deterministic encoding.

https://datatracker.ietf.org/doc/html/rfc8949#section-4.2 may be interesting if you haven't come across it.

use cases ?

The README lack a section explaining why use µon. In which case it should be preferred to other serialization formats and why.

Request for clarification of how the 8C tag interacts with lists

Say I have a nested object that for some reason contains the strings "parallelepiped" and "therizinosaurus" many times, and I know this. Which of the two following ways of encoding an 8C tagged list is correct?

8C 90 <utf8 "parallelepiped"> 00 <utf8 "therizinosaurus"> 00 91

90 8C  <utf8 "parallelepiped"> 00 8C <utf8 "therizinosaurus"> 00 91

~~(I'm pretending #12 doesn't exist and that 8C-tagged strings are zero-terminated, will update example once that issue is resolved)~~

My guess is that they are both correct, but the former only adds the strings to the LRU without encoding a list into the object, while the latter encodes a list of strings that also get added to the LRU. Whatever the case, it might be good to be explicit about it in the docs somewhere.

Should fixed-length string (type `0x82`) be `0`-terminated?

Could we chain muon on-the-wire data?

Could we plainly chain (concatenate as cat does) muon on-the-wire data and feed them into muon reader/parser?

(in the past I had fun thinking about somewhat similar idea but with not so much convincing result 😉)

Originally posted by @dumblob in #5 (comment)

PSON and IOTMP libraries

Hello @vshymanskyy

Thanks for developing the muon and TinyGSM libraries

I'm a Macker and I use Thinger.io in my projects.
They developed PSON (A Serialization Format for IoT Sensor Networks) and the IOTMP protocol.

I hope the information about PSON and IOTMP can help in muon development.

Thank you for your work

PSON Article: https://www.mdpi.com/1424-8220/21/13/4559/htm
PSON Github: https://github.com/thinger-io/Protoson

IOTMP Article: https://www.mdpi.com/1424-8220/19/5/1044/htm
IOTMP Github: https://github.com/thinger-io/IOTMP

Integration: PSON - IOTMP - ARDUINO and TINYGSM with Thinger.io
Arduino-Library: https://github.com/thinger-io/Arduino-Library

vshymanskyy / muon Goto Github PK

muon's Issues

Recommend Projects

Recommend Topics

Recommend Org