Git Product home page Git Product logo

Comments (3)

MarcusKlik avatar MarcusKlik commented on June 19, 2024

Hi @petermuller71 , thanks for the feature request! That would be very interesting indeed and also add to the stability of the fst file format. For maximum performance, hashing could be done on each internal data block using a fast hashing algorithm like xxHash (which is created by Yann Collet (@Cyan4973), who also created the LZ4 and ZSTD compression algortihms used in fst).

There would be some overhead off course (in file size and performance), so the ability to calculate hashes should be mandatory I think, perhaps with a setting hash = TRUE or similar. When fst.read is used, the user can check hashes by specifying check_hashes = TRUE. Would that be an acceptable solution for your use-case?

Interestingly, because the space occupied by the hash values in the fst format would be relatively small, there would almost be no penalty for reading a fst file with hashes when check_hashes = FALSE (the default). The only significant penalty would be during a fst.write operation and if I take a look at what the xxHash algorithm claims, that penalty would also be very small.

from fst.

MarcusKlik avatar MarcusKlik commented on June 19, 2024

Thinking about it, storing a hash value for each data block would be ideal, because then we can also verify the hash for random access (row- and column) reads. That would fit the framework nicely :-)

from fst.

MarcusKlik avatar MarcusKlik commented on June 19, 2024

Hi @petermuller71, I've prepared the format for storing hash values of data blocks. That means that we don't need a format change when the feature is added. The idea is that write_fst will get a parameter hash that can be used to make fst calculate hashes on (16 kB) data blocks (setting hash = TRUE). When reading data with read_fst, the hashes can be optionally used by specifying (again) hash = TRUE:

dt <- data.table(X = 1:1000)

# write table using the XXH64 from the xxhash library on data blocks
write_fst(dt, "hashed.fst", compress = 50, hash = TRUE)

# read without checking hashes (slightly faster)
dt_read <- read_fst("hashed.fst")

# read and check hashes
dt_read <- read_fst("hashed.fst", hash = TRUE)

With that setup even for hashed tables, the data can still be read without calculating hashes (for optimal performance). But for additional security, hashes can be used to check the data integrity.
Storing hashes will require 8 bytes per 16 kB data block, so a hashed fst file will (only) be 0.05 percent larger as compared to an uncompressed unhashed fst file.

All meta information in the fst format is already hashed. If the hashing proves fast enough, we could also make hashing the data blocks the default option so that we don't need an additional parameter at all. In that case, every single byte in the fst file would be hashed and this would add tremendously to the stability of the package.

I plan to add block hashing to fst version v0.9.0, hope you can wait a little bit longer!

from fst.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.