antirez / smaz Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 128.0 212 KB

Small strings compression library

License: BSD 3-Clause "New" or "Revised" License

C 100.00%

smaz's People

Stargazers

Watchers

Forkers

powerjet levicook pombredanne xiaopeifenng lobo oopos deba12 roggo rurban uriel servalproject alepharchives rjohnsondev 3p3r rj03hou sdpper dragancc wajihcz lisael ibazylchuk aeppert coolride overengineered lindianyin sai-nirish sdrer garysharp karlhungus wajih-o rushabhnagda11 tlcvethan mojobojo mcanthony victor22771659 j0486280 vermaslal dada8397 harricher villuminati zhangmuxi modulexcite enelar jlucasvt chensdraw mmmika chihsiang hugopgodoy lukaslueg jesesun bkarak naldodj-zz leelf00 ifoggz liumorgan leprechaun817 hyperiris apple006 dodng m4k3r-org pigeon-c ps6 ohlordyitsordy neosmatrix r3tr0d3v lpsantil vincent-weng abhieshekumar blocky2019 sthomas69 zautomata im-bhatman topdcw wook8170 rainbow87 tomihasa giancastro matthewkayin anowar1112 backwardn moneytech halosghost kaustav07 sbrichardson juzipeek helmithejoe edwinyosorahardjo ezhangle alexandervlpl matthewljsmith troglobit sheatnoisette nathanhowell chronolaw samsplunks gmag11 fusichang107117 ellenhp sirscriptalot saxenarishu doytsujin

smaz's Issues

Win32 Issues.

Hello, I modified the source slightly to remove the random generation, preset strings, output data, and accept input parameters.

The output prints argv[1] which is equivalent to a strings[j].
Then the hex output of the decompressed data.
Then the output of the compressed data.

Example:
quick brown fox jumps over the lazy dog.
0x717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672e
0xfe712683fe6b5e734129dcfa

However changing the input slightly causes this output:
The quick brown fox jumps over the lazy dog.
0x54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672e
0x48

Aiding Smaz in further compressing repeating characters

Ciao Salvatore,

I'm crossposting this here as I think it's better suited because you're the creator of this project.

Smaz is wonderful as it's able to compress a short string (< 100 bytes) where other compressing tools fail.. But there is a problem with it, particularly repeating characters that it doesn't optimize by itself.

For example the string "this is a short string" compresses fine

\x9b8\xac>\xbb\xf2>\xc3F

It is 9 bytes long. But if you have a short string with repeating characters you have a problem.. for example the string "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" compresses into this

\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe'\n

It is still smaller, but the many "\x04"'s look like a waste of space..

I've been thinking about calculating a letter occurrence and replacing it with a sort of "bookmark".. for example "aaaaaaaaaa" with ten "a" occurrences becomes "a//10".

This is a test Python snippet I've created out of my head, but is very very ugly as of now

a = set("this is a string with many aaaaaaaaaaaaaaaaaaaaaa's")
b = "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"

for i in a:
    if i+i in b: # if char occ. > 2
        o = b.count(i) - 2 
        s = i*o
        c = b.replace(s, i+'//'+str(o))

print c

It then becomes

this is a string with many a//22's

Smaz compressed

\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\xc5\xc5\xff\x0222'\n

My worry is, what if the string contains an url? Is it safe to escape it like "//".. but then you have regex strings.. How can it be escaped in that case?

Finally my clear and concise question is: How do you safely shorten repeating characters that Smaz doesn't compress by itself?

Starting with space

When the text to be compressed starts with space, the result is an empty file. How to solve?

how to use it

how can I use this with different languages? (need a php function to compress strings)

Any special case with 0% compression rate?

Hello,
Is there any special case, in which, this library can not compress the text? Thus, the compression rate would be 0%?

Corta Texto al comprimir

Estimado, al comprimir con el algoritmo un tweet, la compresión se corta en algunos puntos. Por ejemplo la combinación de " M" la detecta como fin de la cadena y no comprime mas. Si se cambia la "M" por "m", el algoritmo sigue funcionando.
Esto también sucede con la secuencia de símbolos " C".

string de tweet: "@ChidubemLatest More to the point. Revenge porn is illegal in California, (senate bill 255). By posting the indecen… https://t.co/LTtwbW75Po"

Saludos

Ruby script

Hello, can you release the ruby script to build new specialized dictionaries ?
Even if its for reference only.
Thanks!

"warning: implicit declaration of function ‘random’" message when compiling

When I compile, I get the following warning:

$ make clean ; make
rm -rf smaz_test
gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
smaz_test.c: In function ‘main’:
smaz_test.c:58:18: warning: implicit declaration of function ‘random’ [-Wimplicit-function-declaration]
         ranlen = random() % 512;

The program compiles and the tests pass, so it looks like this warning is non-critical.

From looking around, I found an SO question/answer that offers a solution.

Adding the -D_XOPEN_SOURCE=600 option to gcc in the Makefile fixes the issue for me:

$ git diff
diff --git a/Makefile b/Makefile
index 62e8ccb..eecbac7 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 all: smaz_test
 
 smaz_test: smaz_test.c smaz.c
-       gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
+       gcc -D_XOPEN_SOURCE=600 -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
 
 clean:
        rm -rf smaz_test

My system information:

$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

codebook with the most frequent ngrams in language/s

I know this guy..;) (from Redis)
did you hand pick the codebook dictionary? how?
have you though about using the most frequent ngrams in language/s?
e.g the top (e.g 32) ngrams from Norvig's ngrams2,3,4,5,6,7,8,9.csv?
How do you optimally pick them for minimum overlap and better compression rates? i.e
ation and tion are the most common 4 and 5 letters long ngrams respectively, tio is the 6th most common 3 letters ngram.
I think you'd get much better/higher compression rates.

I wanna test it, but couldn't find any docs.
so what are these characters?

static char *Smaz_cb[241] = {
"\002s,\266", "\003had\232\002leW", "\003on \216", "", "\001yS",
"\002ma\255\002li\227", "\003or \260", "", "\002ll\230\003s t\277",

GitHub repository description claims this is an “encryption library”

The GitHub repository description is…

Small strings encryption library

…when, going by the README, smaz is a compression library, not an encryption library.

antirez / smaz Goto Github PK

smaz's People

Stargazers

Watchers

Forkers

smaz's Issues

Win32 Issues.

Aiding Smaz in further compressing repeating characters

Starting with space

how to use it

Any special case with 0% compression rate?

Corta Texto al comprimir

Ruby script

"warning: implicit declaration of function ‘random’" message when compiling

codebook with the most frequent ngrams in language/s

GitHub repository description claims this is an “encryption library”

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent