Comments (3)
Hi, Thanks for the feedback and questions.
Your first para is not clear to me. I am handling binary data, but it is not really efficient. The implementation uses delta coding and applies it on whatever UTF-8 sequences it can find in the binary stream and uses the inefficient method for the stray 128 bytes whenever it encounters an invalid UTF-8 sequence. Unishox is designed for compressing text data so trying it on binary content will not give good results.
Yes lengths are expected to be stored separately. If not, there is an option to append a terminator which would add extra bytes to the compressed output.
Good to know that you are thinking of an alternate solution that will supplement this. You are right, I spent a lot of time thinking about this, but it did not become as popular as I expected.
And finally, thanks for pointing out the typos!! I went through the docs a 100 times for publishing it with JOSS and did not find them. I think I need to see a doctor :-)
from unishox2.
Thanks, that clears everything up. I wasn't concerned with arbitrary bytestrings, such as executable, or zip files, I know they wouldn't compress well, even get larger, I was only concerned with them, or mostly that all text, including illegal UTF-8, would decompress back as is. And even such text is likely to compress well. You might put in a note regarding that in your docs (your call if you discuss why and how, just a footnote that it would work is fine by me).
Another option would be to validate the text, and fail (throw an exception, in languages like Julia, or something for you in C), or put in some Unicode codepoint, I forget, as substitution character. I wouldn't want that as the default, even though the Unicode standard demands it. I would want it as a non-default option in your code at best. You could say it's the responsibility of your user to first validate (and there's some extremely fast software out there for that). I would at least add some discussion regarding that to your docs.
In Julia that supports UTF-8 (before also UTF-16 and UTF-32 in the standard library, then moved to a package, since not useful), strictly supports all bytestrings (there's no distinction for opening text or binary files), and doesn't validate by default. That was somewhat controversial, but I support that default, there are pros and cons to both. You have a validate function, and it would be a breaking change to run it by default. I foresee it potentially in Julia 2.0, or rather asking for it to happen globally, as opt in. Since it's not done it's important (to know) your library doesn't require legal UTF-8 to use in Julia. Your code is actually already wrapped by a package. It's not in the standard library, I have an idea for a better string package, that I hope would end up as the default, and adopted for the standard library. Compression is a small part of it, also to support better (localized) sorting, and UTF-8 already gives ASCIIbetical/codepoint order, which is actually awful, for most languages, not even ideal for English, but your (or anyone's) compressed strings would have a problem. With my compressed strings (or likely just used for prefixes), I would get localized sorting. So I wouldn't be using your library, except it might be an excellent fallback.
I wouldn't worry your code isn't getting popular, these things might take time. It seems recent, at least version 2 (which is incompatible, right, because of e.g. some Japanese stuff you implemented before). Does version 2 get decoded by version 1? The Julia wrapper is in the process of upgrading or already done. I'm not sure it's very much used by now, but in case anyone did and stored on disk, then it seems like a problem. Would most, say English text compressed with older v1 get decompressed right with v2?
from unishox2.
No version 1 and 2 are not compatible with each other.
from unishox2.
Related Issues (20)
- Benchmark against Snappy HOT 5
- Decompress length is greater than original length HOT 7
- Simple question HOT 4
- Unishox2/Arduino library crashes in append_bits HOT 2
- Errors in linting with tool-cppcheck HOT 2
- Documentation HOT 3
- any golang binding for this? HOT 2
- can u do one just for email? e.g. -.@ HOT 9
- great work by the way, what's the difference between w-olen and without? HOT 2
- Crash in decompression, strlen(NULL) HOT 18
- Reimplementation Licensing HOT 1
- benchmarks available? HOT 1
- Benchmarking Unishox HOT 2
- Can this compression library be used when the input encoding is not UTF8 or the character set is not Unicode? HOT 1
- Optimize wordlist.h for Unishox3 HOT 4
- Unishox3 not working for ESP32 HOT 8
- Can I know in advance what the output size will be from the input in unishox2? HOT 1
- Example to decode a single line from H file
- when is unishox 3 ready? just curious. been watching this repo for a year+ HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unishox2.