muon is awesome, but the one thing it is lacking versus line-by-line json is ability t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Resynchronization about muon HOT 8 CLOSED

vshymanskyy commented on September 27, 2024

Resynchronization

from muon.

Comments (8)

JobLeonard commented on September 27, 2024 1

Here is an idea for supporting random delimiters without actually having to change the current format. Not sure if that is actually something to worry about (it's not like there are a ton of muon parsers out there that would break, nor that muon is considered a stable format yet), but it might help to keep the core format itself smaller.

Let's assume that we're using the chaining approach suggested in the docs for concatenating muon objects.

The main idea would be to encode the delimiter in one or more valid muon objects. Then one can inject this (sequence of) object(s) as the delimiter between the muon objects we actually care about during concatenation. A parser that knows we are using this approach could avoid allocating the delimiter objects, improving performance. Otherwise they would have to be removed after parsing. Since we're creating a list, that would mean removing every other entry.

To signal that we're using delimiter objects we could do something similar to JavaScript's "use strict"; approach. If the first item of a chained list is a magic string - e.g. "seekable"or "resynchronizable" (eight or sixteen bytes + zero terminator, respectively), it implies we are using delimiters. If we want the delimiters to stay user-definable we could say the second object then determines what the object is (but that would be incompatible with @vshymansky's idea of adding a counter to it).

Next question: which object would be most suitable for this?

Hidden: boring analysis of typed arrays, base-64 strings and LEB128 encoded values, before I realized all of which them are utterly inferior to using one or two U64 values

First, we could try a typed array. Doing so would add a storage overhead of three bytes - 84, B4, XX where the last byte is the length of the array (eight, sixteen, whatever we settle on). A significant downside would be that Typed Array allocation is both really, really slow in JavaScript, and has a relatively high memory overhead (in the order of 200 bytes), so using this object type would likely have a negative impact on performance on the parsing end when not discarding delimiters. Based on that I would advise against using them for this purpose.

One could store the separator as a base64-encoded string. That would increase the size by 1/3 since it encodes only six bits per byte, plus a final zero-terminator byte. Rounding up encoding 16 bytes would take up 23 bytes this way, and 8 bytes would take 12. A bit bigger, but on the other hand string creation and memory overhead are both a lot more optimized in JS engines.

Another option is LEB128 encoding (or whatever varint encoding Muon will settle on - see #8). That would imply an overhead of 0xBB for the start, followed by seven bits per byte - encoding 16 bytes this way would take up 20 bytes, 8 bytes would take up 11 - almost as good as typed arrays! In JavaScript this would imply creating BigInts, for which I do not know where they stand in terms of memory overhead/performance. Given that they're supposed to be used for calculations (meaning they change a lot) one would hope that at least some effort has been put into optimizing them though.

The most efficient option is to use 0xB7 followed by eight bytes. It's the cheapest in terms of added storage overhead (from 8 bytes to 10 bytes, from 16 bytes to 18 bytes), and BigInts in JavaScript should be relatively cheap to create, so that shouldn't add too much overhead when dealing with parsers oblivious to this protocol. The only slightly tricky thing is that in that case, when filtering out the delimiters afterwards, the 16-bytes option would require skipping two delimiter objects for each "real" object in our list.

Again, just some thoughts on how to support this with relatively low complexity without actually having to change the Muon format.

from muon.

vshymanskyy commented on September 27, 2024 1

i'm not opposed to adding a specialized tag for this. we have more than enough unallocated markers. just need to understand the rationale

from muon.

vshymanskyy commented on September 27, 2024

@davebenson how about using a list as a root object (similar to muon chaining), and applying a 0x8B (size) tag to every element? This will make it impossible to run into clashes with a randomly selected delimiter.

from muon.

vshymanskyy commented on September 27, 2024

Having a random delimiter could be a good idea, however, I believe a 16-byte delimiter is an overkill.
Maybe something like: 1 tag byte + 6x payload bytes + 1 byte checksum?
And of course, the parser will have to check that a valid muon object follows.

from muon.

vshymanskyy commented on September 27, 2024

It might be a good idea to also allocate 1 byte for the counter (increases with each delimiter). In this case:
tag (1B) + random (5B) + counter (1B) + checksum (1B) = 8 bytes total

5 random bytes gives us 1,099,511,627,775 permutations.

from muon.

vshymanskyy commented on September 27, 2024

@davebenson please provide some real-world use cases or elaborate on the motivation of this request.
Currently, muon objects can be concatenated as-is (i.e. JSONL is an extension of JSON, but Muon doesn't need any additional separators).
Also, you can reuse Muon magic or padding tags to have some kind of separator.

from muon.

vshymanskyy commented on September 27, 2024

@davebenson waiting for your inputs here

from muon.

davebenson commented on September 27, 2024

I think it's probably fine to not implement this stuff in core muon. Probably searching, indexing, and resynchronization are beyond the core standard. (JSONL isn't really an extension of JSON; it's a way of framing (aka packetizing) json that provides certain conveniences that raw concatenated json doesn't have. Assuming that your stream is objects and arrays, JSON doesn't need end-of-record delimiters either) But, JSONL is binary-searchable -- assuming the lines are sorted. And files can be broken apart without reading the whole file - which is important if you want to get parallelism in processing a single file. And it is resynchronizable as well, meaning that you can seek midway into a huge file, and also that corrupt entries can be skipped.

…

On Fri, Aug 19, 2022 at 5:15 AM Volodymyr Shymanskyy < ***@***.***> wrote: @davebenson <https://github.com/davebenson> waiting for your inputs here — Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACNSBSF64GG26ABCM7ZW63VZ53H7ANCNFSM55U3VHCQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from muon.

Resynchronization about muon HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent