Git Product home page Git Product logo

fsst's People

Contributors

carlopi avatar joka921 avatar peterboncz avatar samansmink avatar seb711 avatar sfmqrb avatar ziyaza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fsst's Issues

Token parsing

Your parsing is reliying on new line and a terminator which is allmost always space for text data. However the Re-Pair algorithm has not such a requirement and can be applied to arbitrary data format (text or binary).
Is there any reason, why you're not simply parsing the data using the terminators (space and other punctuations symbols) as separators ?

Sanitizer reporting out-of-bounds

There seems the be an out-of-bounds error happening in https://github.com/cwida/fsst/blob/master/libfsst.cpp#L394. It triggers the sanitizer when compressing strings over 512 characters. In this case the unaligned load will read past the end of buf[512]. It still works fine, but does throw a sanitizer error in DuckDB CI.

For null-terminated strings I'm not sure I understand how this code would prevent out-of-bounds access to user provided string buffers?

For DuckDB we could work around it by simply increasing the buf size by 8 to stop the sanitizer from complaining. Because we don't use zero terminated strings, fsst would always copy here anyway and place the termination byte at buf[511].

FYI: Making java bindings for fsst

Dear @peterboncz, all,

Just wanted to say thank you for creating and documenting FSST so clearly. I am implementing some java bindings for FSST at FSST4j.

I had to reimplement the decoder in Java (easy enough once I finally understood how the fsst_export data structure is encoded, took me longer than it should have). This is because the fsst_decompress is not findable by jextract as it is an inline function.

Still lots of work to be done before this fsst4j is production ready. The reason I wanted to inform you about this work early is that I thought it was correct to have the automatic generated binding code to be placed into a java package nl.cwi.da.fsst to give proper credit. However, you might prefer that is done differently.

Regards,
Jerven

Symbol tables are non-deterministic

I'd appreciate if you could take a look at pull request #6.

I've noticed that symbol tables built from the exact same input data are not deterministic and I believe it is due to a bug in the makeSample function. I think the PR fixes this, but I'm not entirely sure it's the correct fix.

compression ratio

Hi Peter,

I am working on a columnar database project, and this paper/library is very interesting and I am evaluating how good the fsst compression ratio is comparing to lz4 and zstd.

Here is how I did it:
Test env: macOS 11.6, Intel MacBook Pro 2019

  1. I built fsst using the default CMake setup, and got the fsst binary.
  2. I installed zstd binary (zstd 1.5.0) and lz4 (lz4 1.9.3) using homebrew
  3. I tried several sample text files in the paper/dbtext folder (I tried city/email/credentials/faust), compressing them using both fsst and zstd without any additional parameter (zstd defaults to level 3 compression according to its documentation). I found fsst compressed files are around 10% larger than zstd compressed files for these sample files. This result is close to my expectation.
  4. Then I tried compressing some web server's access log files (a typical kind of data in my project), more specifically, this access log file (121MB, http://www.almhuette-raith.at/apache-log/access.log). For this particular data set, fsst achieves 38% compression ratio (46MB after compression), but zstd achieves 3% compression ratio (3.8MB after compression). And then I gave lz4 a try (using its command line without any parameter), and it can achieve 6% compression ratio (7.4MB after compression). The gap with zstd/lz4 for this data set is non trivial here and is not expected.

I wonder if you can shed some light on this evaluation:

  1. is the fsst compression ratio expected for this data set or am I doing it wrong here?
  2. if this is expected, what kind of data set is more well suited for fsst compression?

Thanks so much.

UPDATE:
I read the paper again, and it did mention json/xml text are around 2~2.5x worse than lz4. But here my test data is a big access log file, and it looks like there is 6x gap for lz4. What could be the type of data that is not working as well as expected?

Undefined Behaviour Sanitizer Errors

Hello,
we are currently working at integrating the FSST Compression into Hyrise.
Our CI Sanitizers complain about undefined-behaviour errors that occur in libfsst.hpp. See here:
image

If we set the following cmake compile options we could reproduce those errors also in the fsst repo:

add_compile_options(-fsanitize=undefined -fno-sanitize-recover=all)
add_link_options(-fsanitize=undefined -fno-sanitize-recover=all)

image

We were curious if this is supposed to be like this and we can safely ignore this or if this could cause some issues.

random access example

hope you can add some example code about random access.

I haven't saw any code snippet about random access and its usage

:)

makeSample buffer overflow in sampleLen

When playing around with the fsst library; i've discovered that I got a buffer overflow in the sampleLen-buffer. After further investigating the issue for that came to the conclusion that my data is a bit small (1290 lines) and the length of the individual strings is very varying (see txt in attachment).

Although the total length of the strings is greater than FSST_SAMPLETARGET; not all strings are added to the sampleBuf. This leads to shorter lines are added several times, which leads to more than (nlines + FSST_SAMPLEMAXSZ/FSST_SAMPLELINE) lines being added and results in a buffer overflow. My first approach to solving this was very dull (sorry for the PR (already removed it)). Therefore I have now; for a temporary fix, introduced an additional check that the sampleLen array does not overflow (https://github.com/seb711/fsst/tree/buffer-overflow).

My question would be whether a solution should be found for this at all, as this is probably a cornercase anyway.

physicians.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.