cwida / fsst Goto Github PK

View Code? Open in Web Editor NEW

346.0 346.0 31.0 37.54 MB

Fast Static Symbol Table (FSST): efficient random-access string compression

License: MIT License

Makefile 0.58% C++ 61.03% C 35.43% Shell 1.48% R 0.38% TeX 0.79% CMake 0.30%

fsst's People

Contributors

Stargazers

Watchers

fsst's Issues

Token parsing

Your parsing is reliying on new line and a terminator which is allmost always space for text data. However the Re-Pair algorithm has not such a requirement and can be applied to arbitrary data format (text or binary).
Is there any reason, why you're not simply parsing the data using the terminators (space and other punctuations symbols) as separators ?

Sanitizer reporting out-of-bounds

There seems the be an out-of-bounds error happening in https://github.com/cwida/fsst/blob/master/libfsst.cpp#L394. It triggers the sanitizer when compressing strings over 512 characters. In this case the unaligned load will read past the end of buf[512]. It still works fine, but does throw a sanitizer error in DuckDB CI.

For null-terminated strings I'm not sure I understand how this code would prevent out-of-bounds access to user provided string buffers?

For DuckDB we could work around it by simply increasing the buf size by 8 to stop the sanitizer from complaining. Because we don't use zero terminated strings, fsst would always copy here anyway and place the termination byte at buf[511].

FYI: Making java bindings for fsst

Dear @peterboncz, all,

Just wanted to say thank you for creating and documenting FSST so clearly. I am implementing some java bindings for FSST at FSST4j.

I had to reimplement the decoder in Java (easy enough once I finally understood how the fsst_export data structure is encoded, took me longer than it should have). This is because the fsst_decompress is not findable by jextract as it is an inline function.

Still lots of work to be done before this fsst4j is production ready. The reason I wanted to inform you about this work early is that I thought it was correct to have the automatic generated binding code to be placed into a java package nl.cwi.da.fsst to give proper credit. However, you might prefer that is done differently.

Regards,
Jerven

Symbol tables are non-deterministic

I'd appreciate if you could take a look at pull request #6.

I've noticed that symbol tables built from the exact same input data are not deterministic and I believe it is due to a bug in the makeSample function. I think the PR fixes this, but I'm not entirely sure it's the correct fix.

reserved identifier violation

I would like to point out that an identifier like “_FSST_H_” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

compression ratio

Hi Peter,

I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.

Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019

I built fsst using the default CMake setup, and got the fsst binary.
I installed zstd binary (zstd 1.5.0) and lz4 (lz4 1.9.3) using homebrew
I tried several sample text files in the paper/dbtext folder (I tried city/email/credentials/faust), compressing them using both fsst and zstd without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.
Then I tried compressing some web server's access log files (a typical kind of data in my project), more specifically, this access log file (121MB, http://www.almhuette-raith.at/apache-log/access.log). For this particular data set, fsst achieves 38% compression ratio (46MB after compression), but zstd achieves 3% compression ratio (3.8MB after compression). And then I gave lz4 a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.

I wonder if you can shed some light on this evaluation:

is the fsst compression ratio expected for this data set or am I doing it wrong here?
if this is expected, what kind of data set is more well suited for fsst compression?

Thanks so much.

UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?

Undefined Behaviour Sanitizer Errors

Hello,
we are currently working at integrating the FSST Compression into Hyrise.
Our CI Sanitizers complain about undefined-behaviour errors that occur in libfsst.hpp. See here:

If we set the following cmake compile options we could reproduce those errors also in the fsst repo:

add_compile_options(-fsanitize=undefined -fno-sanitize-recover=all)
add_link_options(-fsanitize=undefined -fno-sanitize-recover=all)

We were curious if this is supposed to be like this and we can safely ignore this or if this could cause some issues.

In fsst, can the compressed string be used instead of the original string for comparison?

Such as title

random access example

hope you can add some example code about random access.

I haven't saw any code snippet about random access and its usage

makeSample buffer overflow in sampleLen

When playing around with the fsst library; i've discovered that I got a buffer overflow in the sampleLen-buffer. After further investigating the issue for that came to the conclusion that my data is a bit small (1290 lines) and the length of the individual strings is very varying (see txt in attachment).

Although the total length of the strings is greater than FSST_SAMPLETARGET; not all strings are added to the sampleBuf. This leads to shorter lines are added several times, which leads to more than (nlines + FSST_SAMPLEMAXSZ/FSST_SAMPLELINE) lines being added and results in a buffer overflow. My first approach to solving this was very dull (sorry for the PR (already removed it)). Therefore I have now; for a temporary fix, introduced an additional check that the sampleLen array does not overflow (https://github.com/seb711/fsst/tree/buffer-overflow).

My question would be whether a solution should be found for this at all, as this is probably a cornercase anyway.

physicians.txt

cwida / fsst Goto Github PK

fsst's People

Contributors

Stargazers

Watchers

Forkers

fsst's Issues

Token parsing

Sanitizer reporting out-of-bounds

FYI: Making java bindings for fsst

Symbol tables are non-deterministic

reserved identifier violation

compression ratio

Undefined Behaviour Sanitizer Errors

In fsst, can the compressed string be used instead of the original string for comparison?

random access example

makeSample buffer overflow in sampleLen

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent