Comments (6)
Thats a great idea!!
Particularly the part of terminating with a \0 :-). It would be quite useful if the length is not stored.
However, I think it is sufficient to use only 6 bits for termination (00 00 00), by switching back to lower case immediately.
It is also possible to include .
, ,
and few more characters for preset 1 by utilising the fact that there is no upper case letter defined for space.
Are you using preset 1 for some purpose? As far as I can see, there is not much users for such a limited character set. Also, there is not much difference in compression ratio compared to the default character set.
from unishox2.
I think 8 bits zero is enough:
- suppose current state is lower case,
00 00 00 00 A
means switching to sticky UPPER case, and switch back to lower case, then switch to once Upper case, then read a characterA
. In current compressing algorithm, it could be endoced as00 00 A
orA 00
- suppose current state is sticky UPPER case,
00 00 00 00 a
means switching to lower case, sticky UPPER again, then back to lower case again, and reading a charactera
. In current compressing algorithm, it could be encoded as00 a
.
from unishox2.
Since we are only talking about preset 1, I think only 6 bits is sufficient:
- suppose current state is lower case,
00
means next letter is upper case and00 00
means sticky upper case and we can use00 00 00
for termination because we would not want to change to lower case immediately after changing to sticky upper case. - suppose current state is sticky upper case,
00
means change back to lower case. Sticky lower case is not needed because that is the default. In this case just 4 bits (00 00) is sufficient for termination because we would not want to change to upper case or sticky upper case immediately after changing to lower case.
We can add extra zeros to make the last byte all zeros forstrlen()
to work. Howeverstrlen()
is not strictly necessary because the decoder will end upon finding the terminator. We can just pass 0 as length and implement the decoder to end on terminator.
All this is good. But do we have much use for preset 1 in real world?
from unishox2.
#20 contains a working implementation of preset 1 terminator, (8bit version, but can easily changed to 6bit or 4/6bit version).
For the strlen()
, sure not a main target, what we need is the bits stream which cat stop the decompressor.
For real world usage, I am also wondering, may be such as language tokens, mnemonic codes, some category list, etc. But for completeness of the encoding, all presets should share the same major features, or why the preset 1 still exist?
For consistency, the preset 1 is a little wired. From the v2 paper, a more reasonable result is the leanest preset should be preset 2, and ALPHA and NUM are two must-have hcodes.
from unishox2.
Hi, Thank you for the PR. It will take me some time because I work on these only in the weekends.
I initially thought presets were a great idea. But after implementing them I did not find much difference between them and the default. The default has no limitations and almost gives the same ratio.
I am also thinking of a default preset and switching to lesser limited one on encountering different symbols dynamically.
from unishox2.
The default has no limitations and almost gives the same ratio.
Yes, for most cases the default is the best choice. It is better not to leave runtime parameters for switching the H-Codes, just selecting Freq Set should be enough.
from unishox2.
Related Issues (20)
- Benchmark against Snappy HOT 5
- Decompress length is greater than original length HOT 7
- Questions about e.g. handling arbitrary bytestrings; and some feedback HOT 3
- Simple question HOT 4
- Unishox2/Arduino library crashes in append_bits HOT 2
- Errors in linting with tool-cppcheck HOT 2
- Documentation HOT 3
- any golang binding for this? HOT 2
- can u do one just for email? e.g. -.@ HOT 9
- great work by the way, what's the difference between w-olen and without? HOT 2
- Crash in decompression, strlen(NULL) HOT 18
- Reimplementation Licensing HOT 1
- benchmarks available? HOT 1
- Benchmarking Unishox HOT 2
- Can this compression library be used when the input encoding is not UTF8 or the character set is not Unicode? HOT 1
- Optimize wordlist.h for Unishox3 HOT 4
- Unishox3 not working for ESP32 HOT 8
- Can I know in advance what the output size will be from the input in unishox2? HOT 1
- Example to decode a single line from H file
- when is unishox 3 ready? just curious. been watching this repo for a year+ HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unishox2.