Data integrity is pretty important in the organization I work for. Accountants ask

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Feature request: data-integrity check by adding hashvalue about fst HOT 3 OPEN

fstpackage commented on June 19, 2024

Feature request: data-integrity check by adding hashvalue

from fst.

Comments (3)

MarcusKlik commented on June 19, 2024

Hi @petermuller71 , thanks for the feature request! That would be very interesting indeed and also add to the stability of the fst file format. For maximum performance, hashing could be done on each internal data block using a fast hashing algorithm like xxHash (which is created by Yann Collet (@Cyan4973), who also created the LZ4 and ZSTD compression algortihms used in fst).

There would be some overhead off course (in file size and performance), so the ability to calculate hashes should be mandatory I think, perhaps with a setting hash = TRUE or similar. When fst.read is used, the user can check hashes by specifying check_hashes = TRUE. Would that be an acceptable solution for your use-case?

Interestingly, because the space occupied by the hash values in the fst format would be relatively small, there would almost be no penalty for reading a fst file with hashes when check_hashes = FALSE (the default). The only significant penalty would be during a fst.write operation and if I take a look at what the xxHash algorithm claims, that penalty would also be very small.

from fst.

MarcusKlik commented on June 19, 2024

Thinking about it, storing a hash value for each data block would be ideal, because then we can also verify the hash for random access (row- and column) reads. That would fit the framework nicely :-)

from fst.

MarcusKlik commented on June 19, 2024

Hi @petermuller71, I've prepared the format for storing hash values of data blocks. That means that we don't need a format change when the feature is added. The idea is that write_fst will get a parameter hash that can be used to make fst calculate hashes on (16 kB) data blocks (setting hash = TRUE). When reading data with read_fst, the hashes can be optionally used by specifying (again) hash = TRUE:

dt <- data.table(X = 1:1000)

# write table using the XXH64 from the xxhash library on data blocks
write_fst(dt, "hashed.fst", compress = 50, hash = TRUE)

# read without checking hashes (slightly faster)
dt_read <- read_fst("hashed.fst")

# read and check hashes
dt_read <- read_fst("hashed.fst", hash = TRUE)

With that setup even for hashed tables, the data can still be read without calculating hashes (for optimal performance). But for additional security, hashes can be used to check the data integrity.
Storing hashes will require 8 bytes per 16 kB data block, so a hashed fst file will (only) be 0.05 percent larger as compared to an uncompressed unhashed fst file.

All meta information in the fst format is already hashed. If the hashing proves fast enough, we could also make hashing the data blocks the default option so that we don't need an additional parameter at all. In that case, every single byte in the fst file would be hashed and this would add tremendously to the stability of the package.

I plan to add block hashing to fst version v0.9.0, hope you can wait a little bit longer!

from fst.

Feature request: data-integrity check by adding hashvalue about fst HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent