Comments (8)
Here is an idea for supporting random delimiters without actually having to change the current format. Not sure if that is actually something to worry about (it's not like there are a ton of muon parsers out there that would break, nor that muon is considered a stable format yet), but it might help to keep the core format itself smaller.
Let's assume that we're using the chaining approach suggested in the docs for concatenating muon objects.
The main idea would be to encode the delimiter in one or more valid muon objects. Then one can inject this (sequence of) object(s) as the delimiter between the muon objects we actually care about during concatenation. A parser that knows we are using this approach could avoid allocating the delimiter objects, improving performance. Otherwise they would have to be removed after parsing. Since we're creating a list, that would mean removing every other entry.
To signal that we're using delimiter objects we could do something similar to JavaScript's "use strict";
approach. If the first item of a chained list is a magic string - e.g. "seekable"
or "resynchronizable"
(eight or sixteen bytes + zero terminator, respectively), it implies we are using delimiters. If we want the delimiters to stay user-definable we could say the second object then determines what the object is (but that would be incompatible with @vshymansky's idea of adding a counter to it).
Next question: which object would be most suitable for this?
Hidden: boring analysis of typed arrays, base-64 strings and LEB128 encoded values, before I realized all of which them are utterly inferior to using one or two U64 values
First, we could try a typed array. Doing so would add a storage overhead of three bytes - 84
, B4
, XX
where the last byte is the length of the array (eight, sixteen, whatever we settle on). A significant downside would be that Typed Array allocation is both really, really slow in JavaScript, and has a relatively high memory overhead (in the order of 200 bytes), so using this object type would likely have a negative impact on performance on the parsing end when not discarding delimiters. Based on that I would advise against using them for this purpose.
One could store the separator as a base64-encoded string. That would increase the size by 1/3 since it encodes only six bits per byte, plus a final zero-terminator byte. Rounding up encoding 16 bytes would take up 23 bytes this way, and 8 bytes would take 12. A bit bigger, but on the other hand string creation and memory overhead are both a lot more optimized in JS engines.
Another option is LEB128 encoding (or whatever varint encoding Muon will settle on - see #8). That would imply an overhead of 0xBB
for the start, followed by seven bits per byte - encoding 16 bytes this way would take up 20 bytes, 8 bytes would take up 11 - almost as good as typed arrays! In JavaScript this would imply creating BigInt
s, for which I do not know where they stand in terms of memory overhead/performance. Given that they're supposed to be used for calculations (meaning they change a lot) one would hope that at least some effort has been put into optimizing them though.
The most efficient option is to use 0xB7
followed by eight bytes. It's the cheapest in terms of added storage overhead (from 8 bytes to 10 bytes, from 16 bytes to 18 bytes), and BigInt
s in JavaScript should be relatively cheap to create, so that shouldn't add too much overhead when dealing with parsers oblivious to this protocol. The only slightly tricky thing is that in that case, when filtering out the delimiters afterwards, the 16-bytes option would require skipping two delimiter objects for each "real" object in our list.
Again, just some thoughts on how to support this with relatively low complexity without actually having to change the Muon format.
from muon.
i'm not opposed to adding a specialized tag for this. we have more than enough unallocated markers. just need to understand the rationale
from muon.
@davebenson how about using a list as a root object (similar to muon chaining), and applying a 0x8B
(size) tag to every element? This will make it impossible to run into clashes with a randomly selected delimiter.
from muon.
Having a random delimiter could be a good idea, however, I believe a 16-byte delimiter is an overkill.
Maybe something like: 1 tag byte
+ 6x payload bytes
+ 1 byte checksum
?
And of course, the parser will have to check that a valid muon object follows.
from muon.
It might be a good idea to also allocate 1 byte for the counter (increases with each delimiter). In this case:
tag (1B)
+ random (5B)
+ counter (1B)
+ checksum (1B)
= 8 bytes total
5 random bytes gives us 1,099,511,627,775
permutations.
from muon.
@davebenson please provide some real-world use cases or elaborate on the motivation of this request.
Currently, muon objects can be concatenated as-is (i.e. JSONL is an extension of JSON, but Muon doesn't need any additional separators).
Also, you can reuse Muon magic
or padding
tags to have some kind of separator.
from muon.
@davebenson waiting for your inputs here
from muon.
from muon.
Related Issues (20)
- Deterministic encoding HOT 3
- Question regarding numbers being passed between JS and Python versions and what it means for their types HOT 2
- Are 8C tagged strings lenght encoded? HOT 1
- Request for clarification of how the 8C tag interacts with lists HOT 2
- Should LRU cache apply to arbitrary objects, not only strings? HOT 2
- Explanation of what tags actually are HOT 4
- Handling of duplicate keys in dicts HOT 2
- Could we chain muon on-the-wire data? HOT 16
- Add support for `unums/posits`, a new floating point format
- Use `size` tag for fixed-length strings
- PSON and IOTMP libraries HOT 1
- Standard text representation
- Serialization/deserialization HOT 1
- use cases ?
- A list of libraries in other languages?
- Allow int's to be dictionary keys HOT 5
- Handling of strings with nul bytes in them HOT 8
- Consider using `prefix code` and `zigzag` encoding for variable-length integers HOT 2
- Are we using signed or unsigned LEB128? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from muon.