Comments (5)
+1 to this feature request. It would be interesting to see how compares to the HDF5 format, a popular choice for on-disk storage of multidimensional arrays. FYI, the rhdf5 Bioconductor package is a low-level package to access the HDF5 API while the HDF5Array Bioconductor package is a higher-level package that abstracts away the need to really access the HDF5 API.
Back when fst was first released, I experimented with creating a fstarray package (an experiment I abandonded). I think an important feature is to allow efficient, arbitrary subsetting of these "fstarrays"; i.e. a version of read.fst()
that allowed reading in multiple and possibly non-contiguous chunks of a .fst
file.
from fst.
Hi @PeteHaitch , thanks for the feature request and interesting to see that you experimented with a fstarray
package to address the storage of matrices! Indeed, random access on structures with many columns is not very efficient for a columnar based file format (like fst
or feather
). I've been thinking a bit and it seems to me that we can solve that problem by storing the data not in columns but in blocks. Let's say, for example, that we need to store a 100000 x 100000 matrix (with 10e10 elements in total). If we want to read a selection mat[y:(y+1000), x:(x+9999)] using the current format, we need to perform 10000 seek operations to read these 10000 columns (so 1 seek per column). That would have a large performance hit on a SSD drive and probably be a disaster with conventional drives. However, if the data would be stored in blocks of 4096 x 4096 elements each, the same subset would require only 3 seek operations!
To make that happen, code is required that can subset a matrix
given these blocks, so a matrix
should be registered in the format as a separate type (separate from a table
), but we can still use the same compression algorithms and serialization code that is used for table's
.
Interesting idea to compare this feature to the HDF5 format once it is implemented, I will check out the packages that you suggest for that!
from fst.
Hi @PeteHaitch , you say multidimensional array's, would it be interesting to allow storage of 3-dimensional (or more) array's ? For that, I would need to write data-cubes or higher-dimensional data-blobs.
from fst.
Thanks for your considered reply, Marcus. My initial use case was for long matrices (nrow
= 10^6 - 10^9 >> ncol
= 10^2 - 10^3) and accessing all columns in row-wise chunks (e.g., rows 1 - 10000, 10001 - 20000, etc.). But I definitely see your point about the number of seek operations scaling with the number of columns.
The HDF5 format is quite mature, I believe, and a fair bit of effort has gone into chunking blocks of data (https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html).
I would expect that the vast majority of usage would be for 2-dimensional arrays. I have a use case for 3-dimensional arrays but my guess is that the usage of higher dimensional arrays would decay rapidly with increasing dimensionality.
from fst.
Nicely put :-), and the 'seek argument' is explained nicely in the link you specified. For an increasing number of dimensions, the number of seeks increases rapidly again. As you say, it's probably best to start with a maximum of 3 dimensions then (and work from there later on if necessary). Also, for consistency, users should be able to append rows and columns to the stored matrix, and for '>3'-dimensional arrays, that will also be much more complicated. Thanks again for the info and feature request!
from fst.
Related Issues (20)
- wrong forum
- Problem with windows file names encoding
- Progress bar when read/write HOT 1
- fst 0.9.4 package load fails with Rcpp 1.0.6 in R 4.1.0 (but not in R 4.0.5 or with Rcpp 1.0.7) HOT 1
- OpenMP not detected Mac 12 (Monterey) M1 (ARM) Mac HOT 17
- How to extract contents from a fst file when R crashes reading it HOT 2
- mac os, apple M1 installation guide should be updated to include the paths of homebrew installed libomp when using xcode-select c++ compiler HOT 1
- Convert `sql` query from BigQuery to `fst` format HOT 1
- Integer64 still remains numeric upon opening with read_fst HOT 9
- Binaries through r-universe HOT 1
- Chunkwise support for `read.fst`? HOT 3
- R crashes while reading an fst file HOT 15
- attributes are not saved HOT 1
- Unable to save embedded lists
- Can `read_fst` use a filter condition beforehand? HOT 1
- Big-endian seems to work: maybe remove misleading requirement on CRAN? HOT 3
- Why is the first read slower? HOT 2
- Compression rate to minimize reading time? HOT 2
- relatively new install issue HOT 7
- write_fst Seems To Skip Small Tables When Writing In A for Loop HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fst.