cwida / fsst Goto Github PK
View Code? Open in Web Editor NEWFast Static Symbol Table (FSST): efficient random-access string compression
License: MIT License
Fast Static Symbol Table (FSST): efficient random-access string compression
License: MIT License
Your parsing is reliying on new line and a terminator which is allmost always space for text data. However the Re-Pair algorithm has not such a requirement and can be applied to arbitrary data format (text or binary).
Is there any reason, why you're not simply parsing the data using the terminators (space and other punctuations symbols) as separators ?
There seems the be an out-of-bounds error happening in https://github.com/cwida/fsst/blob/master/libfsst.cpp#L394. It triggers the sanitizer when compressing strings over 512 characters. In this case the unaligned load will read past the end of buf[512]. It still works fine, but does throw a sanitizer error in DuckDB CI.
For null-terminated strings I'm not sure I understand how this code would prevent out-of-bounds access to user provided string buffers?
For DuckDB we could work around it by simply increasing the buf size by 8 to stop the sanitizer from complaining. Because we don't use zero terminated strings, fsst would always copy here anyway and place the termination byte at buf[511].
Dear @peterboncz, all,
Just wanted to say thank you for creating and documenting FSST so clearly. I am implementing some java bindings for FSST at FSST4j.
I had to reimplement the decoder in Java (easy enough once I finally understood how the fsst_export data structure is encoded, took me longer than it should have). This is because the fsst_decompress is not findable by jextract as it is an inline function.
Still lots of work to be done before this fsst4j is production ready. The reason I wanted to inform you about this work early is that I thought it was correct to have the automatic generated binding code to be placed into a java package nl.cwi.da.fsst to give proper credit. However, you might prefer that is done differently.
Regards,
Jerven
I'd appreciate if you could take a look at pull request #6.
I've noticed that symbol tables built from the exact same input data are not deterministic and I believe it is due to a bug in the makeSample
function. I think the PR fixes this, but I'm not entirely sure it's the correct fix.
I would like to point out that an identifier like โ_FSST_H_
โ does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
Hi Peter,
I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.
Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019
fsst
binary.zstd
binary (zstd 1.5.0) and lz4
(lz4 1.9.3) using homebrewpaper/dbtext
folder (I tried city/email/credentials/faust), compressing them using both fsst
and zstd
without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.lz4
a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.I wonder if you can shed some light on this evaluation:
Thanks so much.
UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?
Hello,
we are currently working at integrating the FSST Compression into Hyrise.
Our CI Sanitizers complain about undefined-behaviour errors that occur in libfsst.hpp. See here:
If we set the following cmake compile options we could reproduce those errors also in the fsst repo:
add_compile_options(-fsanitize=undefined -fno-sanitize-recover=all)
add_link_options(-fsanitize=undefined -fno-sanitize-recover=all)
We were curious if this is supposed to be like this and we can safely ignore this or if this could cause some issues.
Such as title
hope you can add some example code about random access.
I haven't saw any code snippet about random access and its usage
:)
When playing around with the fsst library; i've discovered that I got a buffer overflow in the sampleLen-buffer. After further investigating the issue for that came to the conclusion that my data is a bit small (1290 lines) and the length of the individual strings is very varying (see txt in attachment).
Although the total length of the strings is greater than FSST_SAMPLETARGET; not all strings are added to the sampleBuf. This leads to shorter lines are added several times, which leads to more than (nlines + FSST_SAMPLEMAXSZ/FSST_SAMPLELINE) lines being added and results in a buffer overflow. My first approach to solving this was very dull (sorry for the PR (already removed it)). Therefore I have now; for a temporary fix, introduced an additional check that the sampleLen array does not overflow (https://github.com/seb711/fsst/tree/buffer-overflow).
My question would be whether a solution should be found for this at all, as this is probably a cornercase anyway.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.