Git Product home page Git Product logo

Comments (5)

PeteHaitch avatar PeteHaitch commented on September 20, 2024

+1 to this feature request. It would be interesting to see how compares to the HDF5 format, a popular choice for on-disk storage of multidimensional arrays. FYI, the rhdf5 Bioconductor package is a low-level package to access the HDF5 API while the HDF5Array Bioconductor package is a higher-level package that abstracts away the need to really access the HDF5 API.

Back when fst was first released, I experimented with creating a fstarray package (an experiment I abandonded). I think an important feature is to allow efficient, arbitrary subsetting of these "fstarrays"; i.e. a version of read.fst() that allowed reading in multiple and possibly non-contiguous chunks of a .fst file.

from fst.

MarcusKlik avatar MarcusKlik commented on September 20, 2024

Hi @PeteHaitch , thanks for the feature request and interesting to see that you experimented with a fstarray package to address the storage of matrices! Indeed, random access on structures with many columns is not very efficient for a columnar based file format (like fst or feather). I've been thinking a bit and it seems to me that we can solve that problem by storing the data not in columns but in blocks. Let's say, for example, that we need to store a 100000 x 100000 matrix (with 10e10 elements in total). If we want to read a selection mat[y:(y+1000), x:(x+9999)] using the current format, we need to perform 10000 seek operations to read these 10000 columns (so 1 seek per column). That would have a large performance hit on a SSD drive and probably be a disaster with conventional drives. However, if the data would be stored in blocks of 4096 x 4096 elements each, the same subset would require only 3 seek operations!
To make that happen, code is required that can subset a matrix given these blocks, so a matrix should be registered in the format as a separate type (separate from a table), but we can still use the same compression algorithms and serialization code that is used for table's.

Interesting idea to compare this feature to the HDF5 format once it is implemented, I will check out the packages that you suggest for that!

from fst.

MarcusKlik avatar MarcusKlik commented on September 20, 2024

Hi @PeteHaitch , you say multidimensional array's, would it be interesting to allow storage of 3-dimensional (or more) array's ? For that, I would need to write data-cubes or higher-dimensional data-blobs.

from fst.

PeteHaitch avatar PeteHaitch commented on September 20, 2024

Thanks for your considered reply, Marcus. My initial use case was for long matrices (nrow = 10^6 - 10^9 >> ncol = 10^2 - 10^3) and accessing all columns in row-wise chunks (e.g., rows 1 - 10000, 10001 - 20000, etc.). But I definitely see your point about the number of seek operations scaling with the number of columns.

The HDF5 format is quite mature, I believe, and a fair bit of effort has gone into chunking blocks of data (https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html).

I would expect that the vast majority of usage would be for 2-dimensional arrays. I have a use case for 3-dimensional arrays but my guess is that the usage of higher dimensional arrays would decay rapidly with increasing dimensionality.

from fst.

MarcusKlik avatar MarcusKlik commented on September 20, 2024

Nicely put :-), and the 'seek argument' is explained nicely in the link you specified. For an increasing number of dimensions, the number of seeks increases rapidly again. As you say, it's probably best to start with a maximum of 3 dimensions then (and work from there later on if necessary). Also, for consistency, users should be able to append rows and columns to the stored matrix, and for '>3'-dimensional arrays, that will also be much more complicated. Thanks again for the info and feature request!

from fst.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.