antirez / smaz Goto Github PK
View Code? Open in Web Editor NEWSmall strings compression library
License: BSD 3-Clause "New" or "Revised" License
Small strings compression library
License: BSD 3-Clause "New" or "Revised" License
Hello, I modified the source slightly to remove the random generation, preset strings, output data, and accept input parameters.
The output prints argv[1] which is equivalent to a strings[j].
Then the hex output of the decompressed data.
Then the output of the compressed data.
Example:
quick brown fox jumps over the lazy dog.
0x717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672e
0xfe712683fe6b5e734129dcfa
However changing the input slightly causes this output:
The quick brown fox jumps over the lazy dog.
0x54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672e
0x48
Ciao Salvatore,
I'm crossposting this here as I think it's better suited because you're the creator of this project.
Smaz is wonderful as it's able to compress a short string (< 100 bytes) where other compressing tools fail.. But there is a problem with it, particularly repeating characters that it doesn't optimize by itself.
For example the string "this is a short string" compresses fine
\x9b8\xac>\xbb\xf2>\xc3F
It is 9 bytes long. But if you have a short string with repeating characters you have a problem.. for example the string "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" compresses into this
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe'\n
It is still smaller, but the many "\x04"'s look like a waste of space..
I've been thinking about calculating a letter occurrence and replacing it with a sort of "bookmark".. for example "aaaaaaaaaa" with ten "a" occurrences becomes "a//10".
This is a test Python snippet I've created out of my head, but is very very ugly as of now
a = set("this is a string with many aaaaaaaaaaaaaaaaaaaaaa's")
b = "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
for i in a:
if i+i in b: # if char occ. > 2
o = b.count(i) - 2
s = i*o
c = b.replace(s, i+'//'+str(o))
print c
It then becomes
this is a string with many a//22's
Smaz compressed
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\xc5\xc5\xff\x0222'\n
My worry is, what if the string contains an url? Is it safe to escape it like "//".. but then you have regex strings.. How can it be escaped in that case?
Finally my clear and concise question is: How do you safely shorten repeating characters that Smaz doesn't compress by itself?
When the text to be compressed starts with space, the result is an empty file. How to solve?
how can I use this with different languages? (need a php function to compress strings)
Hello,
Is there any special case, in which, this library can not compress the text? Thus, the compression rate would be 0%?
Estimado, al comprimir con el algoritmo un tweet, la compresión se corta en algunos puntos. Por ejemplo la combinación de " M" la detecta como fin de la cadena y no comprime mas. Si se cambia la "M" por "m", el algoritmo sigue funcionando.
Esto también sucede con la secuencia de símbolos " C".
string de tweet: "@ChidubemLatest More to the point. Revenge porn is illegal in California, (senate bill 255). By posting the indecen… https://t.co/LTtwbW75Po"
Saludos
Hello, can you release the ruby script to build new specialized dictionaries ?
Even if its for reference only.
Thanks!
When I compile, I get the following warning:
$ make clean ; make
rm -rf smaz_test
gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
smaz_test.c: In function ‘main’:
smaz_test.c:58:18: warning: implicit declaration of function ‘random’ [-Wimplicit-function-declaration]
ranlen = random() % 512;
The program compiles and the tests pass, so it looks like this warning is non-critical.
From looking around, I found an SO question/answer that offers a solution.
Adding the -D_XOPEN_SOURCE=600
option to gcc
in the Makefile
fixes the issue for me:
$ git diff
diff --git a/Makefile b/Makefile
index 62e8ccb..eecbac7 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
all: smaz_test
smaz_test: smaz_test.c smaz.c
- gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
+ gcc -D_XOPEN_SOURCE=600 -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
clean:
rm -rf smaz_test
My system information:
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I know this guy..;) (from Redis)
did you hand pick the codebook dictionary? how?
have you though about using the most frequent ngrams in language/s?
e.g the top (e.g 32) ngrams from Norvig's ngrams2,3,4,5,6,7,8,9.csv?
How do you optimally pick them for minimum overlap and better compression rates? i.e
ation
and tion
are the most common 4 and 5 letters long ngrams respectively, tio
is the 6th most common 3 letters ngram.
I think you'd get much better/higher compression rates.
I wanna test it, but couldn't find any docs.
so what are these characters?
static char *Smaz_cb[241] = {
"\002s,\266", "\003had\232\002leW", "\003on \216", "", "\001yS",
"\002ma\255\002li\227", "\003or \260", "", "\002ll\230\003s t\277",
The GitHub repository description is…
Small strings encryption library
…when, going by the README
, smaz is a compression library, not an encryption library.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.