Git Product home page Git Product logo

Comments (8)

JobLeonard avatar JobLeonard commented on September 27, 2024 1

Here is an idea for supporting random delimiters without actually having to change the current format. Not sure if that is actually something to worry about (it's not like there are a ton of muon parsers out there that would break, nor that muon is considered a stable format yet), but it might help to keep the core format itself smaller.

Let's assume that we're using the chaining approach suggested in the docs for concatenating muon objects.

The main idea would be to encode the delimiter in one or more valid muon objects. Then one can inject this (sequence of) object(s) as the delimiter between the muon objects we actually care about during concatenation. A parser that knows we are using this approach could avoid allocating the delimiter objects, improving performance. Otherwise they would have to be removed after parsing. Since we're creating a list, that would mean removing every other entry.

To signal that we're using delimiter objects we could do something similar to JavaScript's "use strict"; approach. If the first item of a chained list is a magic string - e.g. "seekable"or "resynchronizable" (eight or sixteen bytes + zero terminator, respectively), it implies we are using delimiters. If we want the delimiters to stay user-definable we could say the second object then determines what the object is (but that would be incompatible with @vshymansky's idea of adding a counter to it).

Next question: which object would be most suitable for this?

Hidden: boring analysis of typed arrays, base-64 strings and LEB128 encoded values, before I realized all of which them are utterly inferior to using one or two U64 values

First, we could try a typed array. Doing so would add a storage overhead of three bytes - 84, B4, XX where the last byte is the length of the array (eight, sixteen, whatever we settle on). A significant downside would be that Typed Array allocation is both really, really slow in JavaScript, and has a relatively high memory overhead (in the order of 200 bytes), so using this object type would likely have a negative impact on performance on the parsing end when not discarding delimiters. Based on that I would advise against using them for this purpose.

One could store the separator as a base64-encoded string. That would increase the size by 1/3 since it encodes only six bits per byte, plus a final zero-terminator byte. Rounding up encoding 16 bytes would take up 23 bytes this way, and 8 bytes would take 12. A bit bigger, but on the other hand string creation and memory overhead are both a lot more optimized in JS engines.

Another option is LEB128 encoding (or whatever varint encoding Muon will settle on - see #8). That would imply an overhead of 0xBB for the start, followed by seven bits per byte - encoding 16 bytes this way would take up 20 bytes, 8 bytes would take up 11 - almost as good as typed arrays! In JavaScript this would imply creating BigInts, for which I do not know where they stand in terms of memory overhead/performance. Given that they're supposed to be used for calculations (meaning they change a lot) one would hope that at least some effort has been put into optimizing them though.

The most efficient option is to use 0xB7 followed by eight bytes. It's the cheapest in terms of added storage overhead (from 8 bytes to 10 bytes, from 16 bytes to 18 bytes), and BigInts in JavaScript should be relatively cheap to create, so that shouldn't add too much overhead when dealing with parsers oblivious to this protocol. The only slightly tricky thing is that in that case, when filtering out the delimiters afterwards, the 16-bytes option would require skipping two delimiter objects for each "real" object in our list.

Again, just some thoughts on how to support this with relatively low complexity without actually having to change the Muon format.

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024 1

i'm not opposed to adding a specialized tag for this. we have more than enough unallocated markers. just need to understand the rationale

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024

@davebenson how about using a list as a root object (similar to muon chaining), and applying a 0x8B (size) tag to every element? This will make it impossible to run into clashes with a randomly selected delimiter.

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024

Having a random delimiter could be a good idea, however, I believe a 16-byte delimiter is an overkill.
Maybe something like: 1 tag byte + 6x payload bytes + 1 byte checksum?
And of course, the parser will have to check that a valid muon object follows.

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024

It might be a good idea to also allocate 1 byte for the counter (increases with each delimiter). In this case:
tag (1B) + random (5B) + counter (1B) + checksum (1B) = 8 bytes total

5 random bytes gives us 1,099,511,627,775 permutations.

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024

@davebenson please provide some real-world use cases or elaborate on the motivation of this request.
Currently, muon objects can be concatenated as-is (i.e. JSONL is an extension of JSON, but Muon doesn't need any additional separators).
Also, you can reuse Muon magic or padding tags to have some kind of separator.

from muon.

vshymanskyy avatar vshymanskyy commented on September 27, 2024

@davebenson waiting for your inputs here

from muon.

davebenson avatar davebenson commented on September 27, 2024

from muon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.