siara-cc / unishox2 Goto Github PK

Compression for Unicode short strings (works on arduino)

License: Apache License 2.0

Makefile 0.01% C 99.44% C++ 0.36% CMake 0.01% TeX 0.17% Go 0.01% Shell 0.01%

compression arduino string-compression-algorithms string-compression database-compression iot short-string json-compression xml-compression cost-optimization

unishox2's Introduction

Unishox: A hybrid encoder for Short Unicode Strings

In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino. So Unishox algorithm was developed for individually compressing (and decompressing) short strings.

This is a C/C++ library. See here for CPython version and here for Javascript version which is interoperable with this library.

The contenders for Unishox are Smaz, Shoco, Unicode.org's SCSU and BOCU (implementations here and here) and AIMCS (Implementation here).

Note: Unishox provides the best compression for short text and not to be compared with general purpose compression algorithm like lz4, snappy, lzma, brottli and zstd.

Applications

Faster transfer of text over low-speed networks such as LORA or BLE
Compression for low memory devices such as Arduino and ESP8266
Compression of Chat application text exchange including Emojis
Storing compressed text in database
Bandwidth and storage cost reduction for Cloud

Unishox3 Alpha

The next version Unishox3 which includes multi-level static dictionaries residing in RAM or Flash memory provides much better compression than Unishox2. A preview is available in Unishox3_Alpha folder and a make file is available. To compile please use the following steps:

cd Unishox3_Alpha
make
../usx3 "The quick brown fox jumped over the lazy dog"

This is just a preview and the specification and dictionaries are expected to change before Unishox3 will be released. However, this folder will be retained so if someone used it for compressing strings, they can still use it for decompressing them.

Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

The complete specification can be found in this article: A hybrid encoder for compressing Short Unicode Strings. This can also be found at figshare here with DOI 10.6084/m9.figshare.17056334.v2.

Compiling

To compile, just use make or use gcc as follows:

gcc -std=c99 -o unishox2 test_unishox2.c unishox2.c

Unit tests (automated)

For testing the compiled program, use:

./test_unishox2 -t

This invokes run_unit_tests() function of test_unishox2.c, which tests all the features of Unishox2, including edge cases, using 159 strings covering several languages, emojis and binary data.

Further, the CI pipeline at .github/workflows/c-cpp.yml runs these tests for all presets and also tests file compression for the different types of files in sample_texts folder. This happens whenever a commit is made to the repository.

API

int unishox2_compress_simple(const char *in, int len, char *out);
int unishox2_decompress_simple(const char *in, int len, char *out);

Usage

To see Unishox in action, simply try to compress a string:

./test_unishox2 "Hello World"

To compress and decompress a file, use:

./test_unishox2 -c <input_file> <compressed_file>
./test_unishox2 -d <compressed_file> <decompressed_file>

Note: Unishox is good for text content upto few kilobytes. Unishox does not give good ratios compressing large files or compressing binary files.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Achieving better overall compression

Since Unishox is designed and developed for short texts and other methods are not good for short texts, following logic could be used to achieve better overall compression, since the magic bit(s) at the beginning of compressed bytes can be used to identify Unishox or other methods:

if (size < 1024)
    output = compress_with_unishox(input);
else
    output = compress_with_any_other(input)

The threshold size 1024 is arbitrary and if speed is not a concern, it is also possible to compress with both and use the best.

Interoperability with the JS Library

Strings that were compressed with this library can be decompressed with the JS Library and vice-versa. However please see this section in the documentation for usage.

Projects that use Unishox

Credits

Thanks to Jonathan Greenblatt for his port of Unishox2 that works on Particle Photon
Thanks to Chris Partridge for his port of Unishox2 to CPython and his comprehensive tests using Hypothesis and extensive performance tests
Thanks to Stephan Hadinger for his port of Unishox1 to Python for Tasmota
Thanks to Luis Díaz Más for his PRs to support MSVC and CMake setup
Thanks to James Z.M. Gao for his PRs on improving presets, unit tests, bug fixes and more
Thanks to Jm Casler and Shiv Kokroo for choosing and integrating Unishox into Meshtastic project

Versions

The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available as unishox1.c, but it will have to be compiled manually if it is needed.

The next version would be Unishox3 and it would include a multi-level static dictionaries residing in RAM or Flash memory that would greatly improve compression ratios compared to Unishox2. However Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

License for AI bots

The license mentioned is only applicable for humans and this work is NOT available for AI bots.

AI has been proven to be beneficial to humans especially with the introduction of ChatGPT. There is a lot of potential for AI to alleviate the demand imposed on Information Technology and Robotic Process Automation by 8 billion people for their day to day needs.

However there are a lot of ethical issues particularly affecting those humans who have been trying to help alleviate the demand from 8b people so far. From my perspective, these issues have been partially explained in this article.

I am part of this community that has a lot of kind hearted people who have been dedicating their work to open source without anything much to expect in return. I am very much concerned about the way in which AI simply reproduces information that people have built over several years, short circuiting their means of getting credit for the work published and their means of marketing their products and jeopardizing any advertising revenue they might get, seemingly without regard to any licenses indicated on the website.

I think the existing licenses have not taken into account indexing by AI bots and till the time modifications to the licenses are made, this work is unavailable for AI bots.

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at [email protected] or create GitHub issue.

unishox2's People

Contributors

Stargazers

Watchers

unishox2's Issues

Example to decode a single line from H file

I can't seem to see any example of how to decode a line from a generated H file.

The documentation on unishox2_decompress_lines is very confusing, it doesn't seem to be the right function.

out of bound and infinite loop when decompressing invalid count

The decoder may read 0 count by readCount() when decompressing invalid bytes, then the loop here and here may cause reading out of output buffer boundary in a nearly infinite loop.

Benchmark against Snappy

I am coming from Crystal-lang. I want to experiment this awesome library, so I create a C-binding for it.
The snappy code is written natively in crystal, here is the link https://github.com/naqvis/snappy
The input is Vietnamese text encoded in UTF-8 which size is 1008 bytes.

Here is the benchmark result.

Compression benchmark
unishox 880.16  (  1.14ms) (± 9.10%)   1.0kB/op  52.39× slower
 snappy  46.11k ( 21.69µs) (± 8.79%)  4.04kB/op        fastest

Decompression benchmark
unishox  28.76k ( 34.77µs) (± 8.71%)  1.0kB/op   4.46× slower
 snappy 128.24k (  7.80µs) (± 7.62%)  1.0kB/op        fastest

snappy: in_size = 1008, out_size = 774, ratio = 76.78571428571429
unishox: in_size = 1008, out_size = 624, ratio = 61.904761904761905

The speed is reasonale because the compression ration is good. Though, I wonder if we can make it faster :)

Unishox2/Arduino library crashes in append_bits

Hi!

We're looking at implementing this library to be used with the Meshtastic ( https://meshtastic.org ) project but unfortunately there is a runtime error when running the library from just sample code. I've provided a decoded back trace of the execution on an esp32.

Guidance would be appreciated.

Test code in GitHub:

https://github.com/mc-hamster/Unishox2-Test

This can be opened in Visual Studio Code with PlatformIO using the Arduino Framework.

Thanks mate!

Error:

Guru Meditation Error: Core 1 panic'ed (StoreProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x400d0db9 PS : 0x00060530 A0 : 0x800d1349 A1 : 0x3ffb1eb0
A2 : 0x00000000 A3 : 0x00000000 A4 : 0x00000100 A5 : 0x00000001
A6 : 0x00ff0000 A7 : 0xff000000 A8 : 0x00000000 A9 : 0x00000080
A10 : 0x00000001 A11 : 0x00000008 A12 : 0x80000007 A13 : 0xfffffff8
A14 : 0x3f400198 A15 : 0x00000000 SAR : 0x0000001f EXCCAUSE: 0x0000001d
EXCVADDR: 0x00000000 LBEG : 0x400d0d1e LEND : 0x400d0d44 LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000

Backtrace: 0x400d0db9:0x3ffb1eb0 0x400d1346:0x3ffb1ed0 0x400d1c4a:0x3ffb1f40 0x400d0cc1:0x3ffb1f70 0x400d2e72:0x3ffb1fb0 0x40086125:0x3ffb1fd0
#0 0x400d0db9:0x3ffb1eb0 in append_bits(char*, int, unsigned char, int) at src/mesh/compression/unishox2.cpp:707
#1 0x400d1346:0x3ffb1ed0 in unishox2_compress_lines(char const*, int, char*, unsigned char const*, unsigned char const*, char const**, char const**, us_lnk_lst*) at src/mesh/compression/unishox2.cpp:707
#2 0x400d1c4a:0x3ffb1f40 in unishox2_compress_simple(char const*, int, char*) at src/mesh/compression/unishox2.cpp:710
#3 0x400d0cc1:0x3ffb1f70 in setup() at src/main.cpp:24
#4 0x400d2e72:0x3ffb1fb0 in loopTask(void*) at /Users/jmcasler/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:18
#5 0x40086125:0x3ffb1fd0 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

Rebooting...

Benchmarking Unishox

I've integrated unishox into Turbobench and made some tests with your test files.
As Unishox2 is very slow, I've made the tests with unishox3. unishox3 can't compress all the test files without errors.
Here a benchmark successfully compressed and decompressed files without error.
Unishox compression is better only for very small files (< 1k).

     C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
        8212    13.5      16.20      22.55   bsc 0e2         hindi.txt
	9927    16.3       0.84     535.41   brotli 11       hindi.txt
       10226    16.8       4.71     182.20   lzma 9          hindi.txt
       10408    17.1       0.47    1968.94   zstd 22         hindi.txt
       10502    17.2       0.21     792.69   zopfli          hindi.txt
       10546    17.3       5.07    1795.21   libdeflate 12   hindi.txt
       11425    18.7       6.06     782.53   zlib 9          hindi.txt
       13811    22.6      11.72    4695.15   lz4 16          hindi.txt
       15248    25.0       0.20     182.75   unishox3        hindi.txt
       61037   100.0   30518.50   30518.50   memcpy          hindi.txt
      116431   190.8     544.97     792.69   shoco           hindi.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File              
       14612    32.3      10.00      12.89   bsc 0e2         spanish.txt
       15442    34.2       0.86     223.61   brotli 11       spanish.txt
       16522    36.6       6.21      68.34   lzma 9          spanish.txt
       16744    37.1       0.40     406.94   zopfli          spanish.txt
       16769    37.1       8.40     740.49   libdeflate 12   spanish.txt
       16797    37.2       0.38     868.65   zstd 22         spanish.txt
       17584    38.9      18.86     370.25   zlib 9          spanish.txt
       19692    43.6       0.04      88.05   unishox3        spanish.txt
       21889    48.5      18.69    3011.33   lz4 16          spanish.txt
       37079    82.1      94.10     438.54   shoco           spanish.txt
       45170   100.0   22585.00   45170.00   memcpy          spanish.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        6235    43.2       0.93     166.02   brotli 11       chinese.txt
        6506    45.0       4.58       6.36   bsc 0e2         chinese.txt
        6828    47.3       7.73      53.50   lzma 9          chinese.txt
        7093    49.1       0.31     515.86   zopfli          chinese.txt
        7098    49.1       0.13     555.54   zstd 22         chinese.txt
        7187    49.8      12.58     577.76   libdeflate 12   chinese.txt
        7450    51.6      33.51     498.07   zlib 9          chinese.txt
        8989    62.2       0.20      76.02   unishox3        chinese.txt
        9458    65.5      30.41    2888.80   lz4 16          chinese.txt
       14444   100.0   14444.00   14444.00   memcpy          chinese.txt
       25170   174.3     283.22     722.20   shoco           chinese.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        2882    35.9       4.45       6.03   bsc 0e2         zh.txt
        2932    36.5       0.95     167.35   brotli 11       zh.txt
        3061    38.1       0.07     669.42   zstd 22         zh.txt
        3091    38.5       7.01      96.78   lzma 9          zh.txt
        3098    38.6       0.54     669.42   zopfli          zh.txt
        3105    38.7       6.46     730.27   libdeflate 12   zh.txt
        3233    40.2      19.98     669.42   zlib 9          zh.txt
        4386    54.6      23.22    2677.67   lz4 16          zh.txt
        4492    55.9       0.67     110.04   unishox3        zh.txt
        8033   100.0    8033.00    8033.00   memcpy          zh.txt
       15668   195.0     259.13    1004.12   shoco           zh.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        2150    28.5       3.83       5.64   bsc 0e2         ru.txt
        2255    29.9       0.75     160.62   brotli 11       ru.txt
        2484    32.9       4.24     125.82   lzma 9          ru.txt
        2515    33.3       0.12     629.08   zopfli          ru.txt
        2530    33.5       4.91     754.90   libdeflate 12   ru.txt
        2546    33.7       0.07     686.27   zstd 22         ru.txt
        2628    34.8      15.01     580.69   zlib 9          ru.txt
        3168    42.0       0.43     193.56   unishox3        ru.txt
        3591    47.6      15.22    1258.17   lz4 16          ru.txt
        7549   100.0    7549.00    7549.00   memcpy          ru.txt
       14244   188.7     215.69     686.27   shoco           ru.txt
	   
     C Size  ratio%     C MB/s     D MB/s   Name            File               
        1916    28.5       0.88     197.68   brotli 11       json3.txt
        2068    30.8       5.73     160.02   lzma 9          json3.txt
        2115    31.5       0.35     611.00   zopfli          json3.txt
        2120    31.5       0.06     840.12   zstd 22         json3.txt
        2120    31.5       5.13     746.78   libdeflate 12   json3.txt
        2124    31.6       3.81       5.59   bsc 0e2         json3.txt
        2163    32.2      23.18     611.00   zlib 9          json3.txt
        2261    33.6       0.09     292.22   unishox3        json3.txt
        2937    43.7      23.02    3360.50   lz4 16          json3.txt
        5727    85.2      97.41     517.00   shoco           json3.txt
        6721   100.0    6721.00    6721.00   memcpy          json3.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
         282    25.3       0.51      53.14   brotli 11       xml1.txt
         308    27.6       3.58     372.00   libdeflate 12   xml1.txt
         308    27.6       0.06     279.00   zopfli          xml1.txt
         312    28.0      10.63     279.00   zlib 9          xml1.txt
         318    28.5       0.01     372.00   zstd 22         xml1.txt
         321    28.8       2.89      65.65   lzma 9          xml1.txt
         344    30.8       0.99       1.45   bsc 0e2         xml1.txt
         363    32.5       0.08     223.20   unishox3        xml1.txt
         458    41.0       5.81     558.00   lz4 16          xml1.txt
         963    86.3      85.85     558.00   shoco           xml1.txt
        1116   100.0    1116.00   1116.00    memcpy          xml1.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
         102    42.3       0.05     120.50   unishox3        json1.txt
         130    53.9       0.62      17.21   lzma 9          json1.txt
         131    54.4       0.02      80.33   zopfli          json1.txt
         132    54.8       2.11     241.00   zlib 9          json1.txt
         132    54.8       0.74      80.33   libdeflate 12   json1.txt
         133    55.2       0.16       8.61   brotli 11       json1.txt
         134    55.6       0.00     120.50   zstd 22         json1.txt
         165    68.5       1.48     241.00   lz4 16          json1.txt
         172    71.4       0.24       0.28   bsc 0e2         json1.txt
         221    91.7      14.18     241.00   shoco           json1.txt
         241   100.0     241.00     241.00   memcpy          json1.txt

great work by the way, what's the difference between w-olen and without?

-rwxr-xr-x 1 root root 86208 Jun 23 13:27 test_unishox2
-rwxr-xr-x 1 root root 86208 Jun 23 13:27 test_unishox2-w-olen

Typo in usx_sets

Hello,

While investigating the dictionary table for customizing it for my use case, I have found a bug in the dictionary compared with the dictionary mentioned in your documentation
byte usx_sets[][28] = {{ 0, ' ', 'e', 't', 'a', 'o', 'i', 'n',
's', 'r', 'l', 'c', 'd', 'h', 'u', 'p', 'm', 'b',
'g', 'w', 'f', 'y', 'v', 'k', 'q', 'j', 'x', 'z'},
{'"', '{', '}', '_', '<', '>', ':', '\n',
0, '[', ']', '\', ';', ''', '\t', '@', '*', '&',
'?', '!', '^', '|', '\r', '~', '`', 0, 0, 0},
{ 0, ',', '.', '0', '1', '9', '2', '5', '-',
'/', '3', '4', '6', '7', '8', '(', ')', ' ',
'=', '+', '$', '%', '#', 0, 0, 0, 0, 0}};

it shall be instead:

byte usx_sets[][28] = {{ 0, ' ', 'e', 't', 'a', 'o', 'i', 'n',
's', 'r', 'l', 'c', 'd', 'h', 'u', 'p', 'm', 'b',
'g', 'w', 'f', 'y', 'v', 'k', 'q', 'j', 'x', 'z'},
{'"', '{', '}', '_', '<', '>', ':', '\n',
'\r\n', '[', ']', '\', ';', ''', '\t', '@', '*', '&',
'?', '!', '^', '|', '\r', '~', '`', 0, 0, 0},
{ 0, ',', '.', '0', '1', '9', '2', '5', '-',
'/', '3', '4', '6', '7', '8', '(', ')', ' ',
'=', '+', '$', '%', '#', 0, 0, 0, 0, 0}};

Thanks ^_^

unishox1_decompress segfaults on truncated input

If you take some valid compressed data, truncate it at an arbitrary point, and pass that truncated data to unishox1_decompress, sometimes it'll read past the end of the buffer and segfault. Specifically, getCodeIdx/getBitVal unconditionally read even when the bitstream would cross a byte boundary past the end of the buffer.

I originally discovered this behavior because unishox1_compress will sometimes return a byte sequence with interior NUL bytes, and due to an unrelated bug, its output was truncated to the first NUL byte. It appeared like some valid text sequences wouldn't always round-trip until I discovered that the test code in question had that bug. Round-tripping now seems to be correct (modulo #6) but this bug remains.

Can this compression library be used when the input encoding is not UTF8 or the character set is not Unicode?

Can this compression library be used when the input encoding is not UTF8(eg. short binary data) or the character set is not Unicode(eg. gb2312 big5 Shift_JIS)?

can u do one just for email? e.g. -.@

There's no parameter for output buffer length

unishox1_{de,}compress will gladly write beyond the end of the output buffer because there's no way to know that they're at the end. This is exacerbated given that there's no way to know in advance what the output size will be from the input, so there's no way to choose a buffer size without the danger of an overrun.

Both of these are issues worth solving, though technically either one will kind of obviate the other. If I were to choose just one, I'd pick "output buffer length as a parameter" over "output length calculator".

Documentation

readme.md doesn't mention

what happens when encoding arbitrary byte array (eg. utf8 with transmission errors)
will unishox2_decompress_simple still be able to recover original byte array?
Can decompress_simple take arbitrary/random input without crashing? (or input needs to be checksum checked)
Is it guaranteed that len(compressed data) <= len(original)?
API doesn't describe the return values in code.

Reimplementation Licensing

Hello, I'm looking to reimplement Unishox2 in at least one cross compiling statically typed language.
This is derivative work but I'm nevertheless confused about licensing.
Would it be proper to append myself?

Apache-2.0 License
Copyright (c) 2020-2022 Siara Logics (cc)
Copyright (c) 2022 exxjob

Simple question

Hi, I'm the author https://github.com/gbaraldi/Unishox.jl, a julia wrapper of unishox.
Is the NUL character an allowed value in the middle of a compressed string?
I say that because in a pathological string I got one, and since julia normally expects a C style NUL terminated string when getting a string from a pointer I got a bug. Thanks for the cool library anyway :).

benchmarks available?

hi, any benchmarks for ratio, speed, cpu usage and memory usage for >800bytes, 2k and 10k?
compared with zstd, lz4 and snappy? something like this extracted from zstd github

Compressor name | Ratio | Compression | Decompress.
-- | -- | -- | --
zstd 1.5.1 -1 | 2.887 | 530 MB/s | 1700 MB/s
zlib 1.2.11 -1 | 2.743 | 95 MB/s | 400 MB/s
brotli 1.0.9 -0 | 2.702 | 395 MB/s | 450 MB/s
zstd 1.5.1 --fast=1 | 2.437 | 600 MB/s | 2150 MB/s
zstd 1.5.1 --fast=3 | 2.239 | 670 MB/s | 2250 MB/s
quicklz 1.5.0 -1 | 2.238 | 540 MB/s | 760 MB/s
zstd 1.5.1 --fast=4 | 2.148 | 710 MB/s | 2300 MB/s
lzo1x 2.10 -1 | 2.106 | 660 MB/s | 845 MB/s
lz4 1.9.3 | 2.101 | 740 MB/s | 4500 MB/s
lzf 3.6 -1 | 2.077 | 410 MB/s | 830 MB/s
snappy 1.1.9 | 2.073 | 550 MB/s | 1750 MB/s

when is unishox 3 ready? just curious. been watching this repo for a year+

when is unishox 3 ready? just curious. been watching this repo for a year+
is there an ETA for unishox 3?

Can we compress .txt file in zip format and decompress it on windows/server?

I have a data-log file in .txt format of 2Mb each. Need to compress it in zip format (filename.zip) and send it over the GPRS to web server, there the web server can again decompress it. Is it possible by this library?

Optimize wordlist.h for Unishox3

The current implementation of Unishox3 uses wordlist.h which is a header file containing arrays of strings that are used for encoding and decoding text. However, the current implementation has some inefficiencies that could be optimized.

Firstly, the current implementation calls strlen to get the length of each string in the arrays. One potential solution to this is to use std::string_view, which gets the length of the string at compile time. However, this approach can increase memory usage since there are many strings in the array.

To address this issue, the wordlist.h file can be flattened into a single string, wordlist.bin, which can be compiled into the binary. This significantly reduces the amount of memory required since there are no longer any pointers to the individual strings, and the string length is already known at compile time. To know where each string is located in the flattened string, an array of indices is used wordlist_index.h.

Additionally, this optimization can improve CPU performance since the strings are now located in a contiguous block of memory, which improves spatial locality and cache performance.

To implement this optimization, we propose the following changes:
1. Flatten the wordlist.h file into a single string, wordlist.bin, which can be compiled into the binary.
2. Create an array of indices, wordlist_index, to keep track of where each string starts and ends in the flattened string.
3. Modify the getDictWord() function to use the wordlist_index array to retrieve the correct string from the flattened wordlist.bin string.

and are but for haveingjustlikenot thatthe theythiswas withyou ", "": ": "":""},.com.org</="aboutall alsobeencan coulddon'tevenfromget goodhad has her his how i'm in theit'sknowmakemoremuchof theone onlyotherout she somestillthantheirthemthentherethinktimeto ...

const int wordlist_index[] = {0, 4, 8, 12, 16, 20, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 70, ...

int getOffset(int lvl)
{
    switch (lvl) {
    case 0:
        return 0;
    case 1:
        return 16;
    case 2:
        return 16 + 64;
    case 3:
        return 16 + 64 + 256;
    case 4:
        return 16 + 64 + 256 + 2048;
    case 5:
        return 16 + 64 + 256 + 2048 + 32768;
    case 6:
        return 16 + 64 + 256 + 2048 + 32768 + 131072;
    default:
        return std::size(wordlist_index); // invalid
    }
}

extern char _binary_wordlist_bin_start[];

const char *getDictWord(int lvl, int pos, size_t *size)
{
    constexpr auto wordlist_bin = (const char *)_binary_wordlist_bin_start;

    size_t index = getOffset(lvl) + pos;
    if (index < std::size(wordlist_index) - 1) {
        auto start = wordlist_index[index];
        auto end = wordlist_index[index + 1];
        *size = end - start;
        return wordlist_bin + start;
    }

    *size = 0;
    return nullptr;
}

This optimization can also reduce the time required for compilation since the wordlist.h file is no longer required, and the compilation of the binary is faster.

We have tested this implementation and found that it reduces both memory usage and execution time compared to the original implementation.

ld -r -b binary -o wordlist_bin.o wordlist.bin
g++ main.cpp wordlist_bin.o && ./a.out

Compress strings with escape sequences

Hi,

Thanks for this great library. I'm looking to compress a multiline string (containing whitespace characters\r and \n). How do I specify the escape sequence in my input file?

Thanks.

Javascript Library

Please make a Javascript library for this. This is perfect for sending messages over WebSockets and MQTT Brokers.

Unishox for raspberry pi, linux or windows?

Is there an unishox version for desktop plataforms?

Can I know in advance what the output size will be from the input in unishox2?

How big buffer should be allocated for output before calling compress and decompress function？

Crash in decompression, strlen(NULL)

The following program dies dereferencing USX_TEMPLATES[4], which is NULL.

#include <stdio.h>
#include <string.h>
#include "unishox2.h"

int main()
{
  static const char *input = "\252!\355\347;멠<\322\336\346\070\205X\200v\367b\002\332l\213\022\n\003P\374\267\002\266e\207.\210r:\021\225\224\243\353\204\305\352\255\017L/(HH4i\223~\270-\223\206\221\246\212\261\221e\254\375\341\350\037\240X\211ǉ\325\330u\365\303ʂ\200гM\236&\375\377\071%'?V\025\070\374\026\346s\323$\276\350F\224\r-\226\347ɋ\317\344\214\v\032U\303\353\215\335GX\202\371B\302\355\a\247\273\356C\372\a-\262\006\\\343\"ZH|\357\034\001";
  char out[4096];
  const size_t len = strlen(input); // no zeroes in it

  unishox2_decompress(input, len, out, /*4096,*/ USX_HCODES_DFLT, USX_HCODE_LENS_DFLT, USX_FREQ_SEQ_TXT, USX_TEMPLATES);

  return 0;
}

Some code point sequences can't round-trip

As an example, 'Ç、.' gets turned into 'Ç、。'. I'm not sure if this is expected behavior or not when mixing planes. That particular sequence was found with a sort of fuzzer, hypothesis.

typo in document: The first set consists of [27] codes

In Section 5 of V2 PDF, is says 'The first set consists of 27 codes'. It should be 28 codes instead.

Incorrectly decompresses non UTF-8 data

	static const char cszData[] = "\x41\x5F\x28\xE1\xF3\xEA\xE2\xE0\x29";

	char chBuf[ 0x100 ];
	char chBuf1[ 0x100 ];

	int nRes;
	nRes = unishox1_compress( cszData, sizeof( cszData ) - 1, chBuf, 0 );
	nRes = unishox1_decompress( chBuf, nRes, chBuf1, 0 ); // incorrect result!

term code does not work for preset 1

In preset 1, we cannot switch to Number state, hence the decoder cannot stop at the sequence "00 11111111", outputting letter Z instead.

A proposal is that, in preset 1, since sticky UPPER case state will be encoded as 00 00, and switch back with an addition 00, then we could treat eight or more continuous 0-bit as a term code. This also supports strlen() on compressed data to calculate the compressed data length for this special preset.

any golang binding for this?

e.g unishox2.Compress(value)
value := unishox2.Decompress(compressValue)

Predefined dictionary

@siara-cc
Hi Arun
I wonder if you could advise, as to the application of this code

I need to compress short strings of between 40 and 80 characters
The dictionary is small in terms, for example

Always Upper case characters
Small set of non-alphanumeric
Small set of full words

There is a very small dictionary of words, of which many are used repetitively, so I want to be able to define substrings with a code, in addition to characters
The size of the dictionary itself, is not important the dictionary is pre-defined, and not provided with the compressed data. the dictionary is known to the encode and decode itself

The question is, if I have this pre-defined dictionary, can I embed this with your project ?

Thx
Lee

Errors in linting with tool-cppcheck

While integrating the unishox2 library with the Meshtastic project, the initial PR reported the following errors as part of the linting with tool-cppcheck.

I'll submit a PR to fix this shortly.

Tool Manager: Installing platformio/tool-cppcheck @ ~1.260.0
25
Downloading...
26
Unpacking...
27
Tool Manager: tool-cppcheck @ 1.260.0 has been installed!
28
src/mesh/compression/unishox2.c:132: [low:style] Redundant condition: If 'c > 32', the comparison 'c != 0' is always true. [redundantCondition]
29
src/mesh/compression/unishox2.c:150: [low:style] The scope of the variable 'cur_bit' can be reduced. [variableScope]
30
src/mesh/compression/unishox2.c:151: [low:style] The scope of the variable 'blen' can be reduced. [variableScope]
31
src/mesh/compression/unishox2.c:152: [low:style] The scope of the variable 'a_byte' can be reduced. [variableScope]
32
src/mesh/compression/unishox2.c:153: [low:style] The scope of the variable 'oidx' can be reduced. [variableScope]
33
src/mesh/compression/unishox2.c:979: [low:style] Local variable 'idx' shadows outer variable [shadowVariable]

Compress on JS and decompress on Arduino

Hello, whenever I do a simple compress of a string on Node.JS using Unishox-JS library and I decode it on C++ (Arduino), it shows strange characters and simply don't work. Has anyone tested it before? Could someone please help me? Thanks!

Decompress length is greater than original length

I have a UTF-8 text whose size is 2925 bytes:

Các bác khai sáng thêm cho e với ah, e nó còn nhỏ mà giỏi quá.
Mà sao im re ko thấy báo chí hay các phương tiện truyền thông gì ...ngoài việc thu thuế.
-----------------------------------------------------------------------------
Tỷ phú đô la công nghệ U30

Trung Nguyễn sáng lập cty Sky Mavis tại TP Hồ Chí Minh đã trở thành tỷ phú (USD) ở tuổi 29 (sinh 1992) và là tỷ phú công nghệ đầu tiên của Việt Nam.

Sản phẩm của Sky Mavis là game Axie Infinity phát triển trên nền công nghệ Blockchain. Khác với các game truyền thống thuần tuý giải trí, Sky Mavis đã đưa Axie Infinity của mình nên một tầm cao mới: không chỉ là game giải trí mà mà là vừa giải trí vừa kiếm tiền thông qua cơ hội đầu tư tiền điện tử AXS.

Sky Mavis được thành lập năm 2017, và Axie Infinity ra đời năm 2018, được phát triển trên nền tảng blockchain lấy cảm hứng từ Pokemon, nơi mà người chơi có thể sở hữu, nhân giống, phát triển và đã nhanh chóng trở thành game hấp dẫn nhất, thu hút 350.000 người dùng hoạt động hàng ngày, chủ yếu đến từ Philippines, Mỹ và Venezuela.

Người chơi Axie có thể sử dụng SLP (Smooth Love Potion) và AXS (Axie Infinity Shard) để mua đất, trang trại hoặc nhân giống Axies. Và vì các SLP và SXS có giá trị, các game thủ cũng có thể sử dụng chúng để trả tiền thuê nhà hoặc thực phẩm trong cuộc sống thực.

Axie nhanh chóng trở thành game blockchain có doanh thu cao nhất thế giới. Theo crytoslam doanh thu của Axie Infinity đến đầu tháng 07/2021 đã đạt 386 triệu USD và theo dữ liệu từ Token Terminal, doanh thu trong 30 ngày qua của Axie là 90 triệu USD.

Nhờ doanh thu tăng trưởng cực nhanh giá của đồng AXS cũng tăng chóng mặt, tăng 600% trong vòng 1 tháng. Diễn biến giá của đồng AXS và vốn hoá của Sky Mavis như sau:

- 07/07 giá 9,69$, 0,6 tỷ USD
- 14/07 giá 21$, 1,3 tỷ USD
- 15/07 giá 29,13$, 1,8 tỷ USD
- 20/07 giá 14,19$, 0,88 tỷ USD
- 23/07 giá 40,01$, 2,44 tỷ USD
.............
- 5/8/2021 giá 44.06$; 2,68 tỷ usd

Với giá AXS tăng chóng mặt như vậy, ngày 14/07 vốn hoá của Sky Mavis đã đạt 1,3 tỷ USD và hiện tại vốn hoá của Sky Mavis đã đạt 2,44 tỷ USD (riêng lượng giao dịch ngày hôm qua lên tới 5,56 tỷ USD).

Thật là một đại kỳ tích: Chỉ mất có 4 năm từ ngày thành lập công ty Sky Mavis (2017) và mất có đúng 3 năm ra đời sản phẩm Axie Infinity (2018), Trung Nguyễn cùng các cộng sự đã đưa giá trị công ty lên 2,44 tỷ USD, một thời gian kỷ lục trong giới công nghệ Châu Á.

The compress_simple works ok. But decompress_simple return 2954. Could you help me take a look?

Edit: now I have error Invalid memory access (signal 11) at address 0x7f9d3590cfdd

[JOSS]: Automated tests

Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified? Not given

Buffer overflow in compression

With the following test case:

#include <stddef.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include "../unishox/unishox2.h"

int main()
{
  const char input[] = {'R', 'o', 'm'};
  char packed[4];
  size_t packed_len = unishox2_compress_simple(input, sizeof(input), packed);
  printf("packed_len: %zu\n", packed_len);
  return 0;
}

The 3 byte input "compresses" to 3 bytes.

However, when built with -fsanitize=address, we see that there is a buffer overflow past the end of packed, which is large enough for the 3 bytes, and even one extra:

=================================================================
==31557==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc5c932f34 at pc 0x0000004016c7 bp 0x7ffc5c932bc0 sp 0x7ffc5c932bb0
WRITE of size 1 at 0x7ffc5c932f34 thread T0
    #0 0x4016c6 in append_bits ../unishox/unishox2.c:114
    #1 0x401dba in append_code ../unishox/unishox2.c:158
    #2 0x40720b in unishox2_compress_lines ../unishox/unishox2.c:690
    #3 0x407855 in unishox2_compress_simple ../unishox/unishox2.c:702
    #4 0x401351 in main /home/user/src/townforge/unishox2-bug.c:11
    #5 0x7854a51b0f42 in __libc_start_main (/lib64/libc.so.6+0x23f42)
    #6 0x40119d in _start (/home/user/src/townforge/a.out+0x40119d)

Address 0x7ffc5c932f34 is located in stack of thread T0 at offset 68 in frame
    #0 0x401265 in main /home/user/src/townforge/unishox2-bug.c:8

  This frame has 2 object(s):
    [48, 51) 'input' (line 9)
    [64, 68) 'packed' (line 10) <== Memory access at offset 68 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow ../unishox/unishox2.c:114 in append_bits

Built with:
gcc -g -O0 unishox2-bug.c ../unishox/unishox2.c
to see the output being 3 bytes, then with:
gcc -g -O0 -fsanitize=address unishox2-bug.c ../unishox/unishox2.c
to see the buffer overflow past those three bytes.

Maybe also related, the input seems to get clobbered with a related input. I'll file another bug for this, in case it's a different cause.

Unit Tests and CI for the project

Hi!

As you might have noticed I already contributed with 2 PRs and I was thinking if you would be interested in adding unit tests and a
continuous integration pipeline to the project. I could work in those 2 topics so that it would be easier for people to check if the project is building correctely in certain platforms and have a safety net checking if new changes break the main functionality.

Unishox3 not working for ESP32

Hello. Where's the marisa.h file that's required in unishox3.h file? It's not there in Unishox3_Alpha directory. Thanks!

Questions about e.g. handling arbitrary bytestrings; and some feedback

Hi,

It's not clear to me that you can compress arbitrary byte-strings. Since in your Unicode-handling you allow over a million extra letters it seems it could handle the 128 possible stray bytes you can expect in an illegal UTF-8 string, but not without a breaking change.

Basically your method changes a char-stream (byte-stream) to a bitstring. So I have some questions and comments.

I assume that when you show compression ratios, and length in bytes, that you actually round up to the nearest byte. It's not clear to me how lengths of strings are stored (in memory), I'm guessing it's stored separately, not by some termination token?

Your PDF is well written, and you've certainly given compression a lot of thought (and unlike me implemented). I have my own idea for compression that I want to implement; for the Julia language, but was thinking maybe your (or others) are good enough, and fast, for short strings. My own idea would always compress by 50% (or fail and use a fallback, which could be your code) or more with an alternative idea, i.e. 6 letters into 2 bytes.

In your PDF you have a typo: "techhniques", and "For Set 3, whenever is switch is made" likely to "a switch is made"? And I would have Unicode uppercased consistently in the docs.