Git Product home page Git Product logo

Comments (3)

siara-cc avatar siara-cc commented on May 26, 2024

Hi, Thanks for the feedback and questions.
Your first para is not clear to me. I am handling binary data, but it is not really efficient. The implementation uses delta coding and applies it on whatever UTF-8 sequences it can find in the binary stream and uses the inefficient method for the stray 128 bytes whenever it encounters an invalid UTF-8 sequence. Unishox is designed for compressing text data so trying it on binary content will not give good results.

Yes lengths are expected to be stored separately. If not, there is an option to append a terminator which would add extra bytes to the compressed output.

Good to know that you are thinking of an alternate solution that will supplement this. You are right, I spent a lot of time thinking about this, but it did not become as popular as I expected.

And finally, thanks for pointing out the typos!! I went through the docs a 100 times for publishing it with JOSS and did not find them. I think I need to see a doctor :-)

from unishox2.

PallHaraldsson avatar PallHaraldsson commented on May 26, 2024

Thanks, that clears everything up. I wasn't concerned with arbitrary bytestrings, such as executable, or zip files, I know they wouldn't compress well, even get larger, I was only concerned with them, or mostly that all text, including illegal UTF-8, would decompress back as is. And even such text is likely to compress well. You might put in a note regarding that in your docs (your call if you discuss why and how, just a footnote that it would work is fine by me).

Another option would be to validate the text, and fail (throw an exception, in languages like Julia, or something for you in C), or put in some Unicode codepoint, I forget, as substitution character. I wouldn't want that as the default, even though the Unicode standard demands it. I would want it as a non-default option in your code at best. You could say it's the responsibility of your user to first validate (and there's some extremely fast software out there for that). I would at least add some discussion regarding that to your docs.

In Julia that supports UTF-8 (before also UTF-16 and UTF-32 in the standard library, then moved to a package, since not useful), strictly supports all bytestrings (there's no distinction for opening text or binary files), and doesn't validate by default. That was somewhat controversial, but I support that default, there are pros and cons to both. You have a validate function, and it would be a breaking change to run it by default. I foresee it potentially in Julia 2.0, or rather asking for it to happen globally, as opt in. Since it's not done it's important (to know) your library doesn't require legal UTF-8 to use in Julia. Your code is actually already wrapped by a package. It's not in the standard library, I have an idea for a better string package, that I hope would end up as the default, and adopted for the standard library. Compression is a small part of it, also to support better (localized) sorting, and UTF-8 already gives ASCIIbetical/codepoint order, which is actually awful, for most languages, not even ideal for English, but your (or anyone's) compressed strings would have a problem. With my compressed strings (or likely just used for prefixes), I would get localized sorting. So I wouldn't be using your library, except it might be an excellent fallback.

I wouldn't worry your code isn't getting popular, these things might take time. It seems recent, at least version 2 (which is incompatible, right, because of e.g. some Japanese stuff you implemented before). Does version 2 get decoded by version 1? The Julia wrapper is in the process of upgrading or already done. I'm not sure it's very much used by now, but in case anyone did and stored on disk, then it seems like a problem. Would most, say English text compressed with older v1 get decompressed right with v2?

from unishox2.

siara-cc avatar siara-cc commented on May 26, 2024

No version 1 and 2 are not compatible with each other.

from unishox2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.