privefl / bigstatsr Goto Github PK
View Code? Open in Web Editor NEWR package for statistical tools with big matrices stored on disk.
Home Page: https://privefl.github.io/bigstatsr/
R package for statistical tools with big matrices stored on disk.
Home Page: https://privefl.github.io/bigstatsr/
Example:
library(bigstatsr)
X <- FBM(10, 10)
test <- big_apply(X, a.FUN = function(X, ind) {
colSums(X[, ind, drop = FALSE])
}, block.size = 2)
is returning the error.
In foreach
, you can give no value to the .combine
parameter, and it will return a list by default.
In bigstatsr, for e.g. big_apply
, as the computation is done by blocks, the parameter a.combine
isn't intended to be left blank.
If you really want to do this, you can wrap the results of a.FUN
with list()
or as.list()
depending on what you want and the class of objects you return, and use a.combine = 'c'
.
library(bigstatsr)
set.seed(1)
# simulating some data
N <- 230
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y <- rowSums(X[, 1:10]) + rnorm(N)
covar <- matrix(rnorm(N * 3), N)
ind.train <- sort(sample(nrow(X), 150))
ind.test <- setdiff(rows_along(X), ind.train)
# fitting model for multiple lambdas and alphas
test <- big_spLinReg(X, y[ind.train], ind.train = ind.train,
covar.train = covar[ind.train, ],
alphas = c(1, 0.5, 0.1, 0.01))
# peek at the models
plot(test)
summary(test)
str(attributes(test)) # OK
attributes(test[1]) # NULL
To remove HTML doc for being counted as lines of code, making this package mainly HTML..
Functions big_apply and big_parallelize can't find functions that I have defined. For instance,
my.sum <- function(M) {
colSums(M)
}
mat.fb <- big_attachExtdata()
big_parallelize(mat.fb, p.FUN = function(X, ind) my.sum(X[ , ind]), ncores=2)
results in the error message Error in { : task 1 failed - "could not find function "my.sum""
. This can be worked around by defining my.sum
in the anonymous function for p.FUN
but since the functions I want to do this for may be rather more complicated and used in other settings, it seems this should be unnecessary.
For example, see this SO thread.
For now, FBM(10, 10, type = "integer", init = 1)
gives a warning because 1
is in fact a double
.
Using FBM(10, 10, type = "integer", init = 1L)
won't give a warning.
Find if there is a fast near-optimal rule approximation for computing multivariate linear/logistic regression on biobank-scale datasets in a few hours (or minutes).
Any specific reasons? Or just a matter of choice?
Reprex:
iris <- datasets::iris
iris$Species <- as.character(iris$Species)
iris <- iris[rep(1:150, 100), rep(1:5, 100)]
bigstatsr::as_FBM(iris)
For example, for pre-filtering.
bigstatsr
is super convenient if a large matrix is already on file but this is often not the case for me. I typically want to use bigstatsr
after I've created a larger-than-memory matrix in R - which is usually in the form of a very sparse matrix (>95% sparsity). Writing to disk is a bottleneck because I have to iterate through the spares matrix by row-chunks (using lapply
or equivalent), convert to non-sparse matrix, and write each chunk to file using append=TRUE
. A faster method for doing this would be really helpful. Thanks!
With the possibility to be 0.
Hello,
You have been kind and answeded my beginner questions on stackoverflow regarding fopreach questions and parallellisationi, my alias there is Newbie80. I have been Reading your intro to foreach and now I am trying to install bigstatsr with the following
devtools::install_github("privefl/bigstatsr")
However I get the following message when I try to install, do you have any idea why?
Regards
Dan
devtools::install_github("privefl/bigstatsr")
Downloading GitHub repo privefl/bigstatsr@master
from URL https://api.github.com/repos/privefl/bigstatsr/zipball/master
trying URL 'https://cran.rstudio.com/bin/windows/Rtools/Rtools34.exe'
Content type 'application/x-msdos-program' length 108085090 bytes (103.1 MB)
downloaded 103.1 MB
WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools 3.4 from http://cran.r-project.org/bin/windows/Rtools/.
Installation failed: Could not find build tools necessary to build bigstatsr
library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
mat3 <- FBM(5, 8)
Error in FBM(5, 8) : could not find function "FBM"
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
tmp3 <- foreach(j = 1:8, .combine = 'c') %:%
mat3[i, j] <- i + j
NULL
parallel::stopCluster(cl)
mat3[]
Error: object 'mat3' not found
mat3 <- FBM(5, 8)
Error in FBM(5, 8) : could not find function "FBM"
library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
install.packages("C:/Users/dasj/Desktop/bigstatsr-master.zip", repos = NULL, type = "win.binary")
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
install.packages("C:/Users/dasj/Desktop/bigstatsr-master/bigstatsr-master/bigstatsr.Rproj", repos = NULL)
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
Error in install.packages : type == "both" cannot be used with 'repos = NULL'
devtools::install_github("privefl/bigstatsr")
Downloading GitHub repo privefl/bigstatsr@master
from URL https://api.github.com/repos/privefl/bigstatsr/zipball/master
Installing bigstatsr
Installing 1 package: BH
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/BH_1.66.0-1.zip'
Content type 'application/zip' length 17880018 bytes (17.1 MB)
downloaded 17.1 MB
package ‘BH’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: cowplot
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘colorspace’, ‘lazyeval’, ‘reshape2’, ‘munsell’, ‘ggplot2’, ‘plyr’, ‘scales’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/colorspace_1.3-2.zip'
Content type 'application/zip' length 445654 bytes (435 KB)
downloaded 435 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/lazyeval_0.2.1.zip'
Content type 'application/zip' length 140066 bytes (136 KB)
downloaded 136 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/reshape2_1.4.3.zip'
Content type 'application/zip' length 611951 bytes (597 KB)
downloaded 597 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/munsell_0.5.0.zip'
Content type 'application/zip' length 220479 bytes (215 KB)
downloaded 215 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/ggplot2_3.0.0.zip'
Content type 'application/zip' length 3146953 bytes (3.0 MB)
downloaded 3.0 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/plyr_1.8.4.zip'
Content type 'application/zip' length 1219076 bytes (1.2 MB)
downloaded 1.2 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/scales_0.5.0.zip'
Content type 'application/zip' length 693959 bytes (677 KB)
downloaded 677 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/cowplot_0.9.2.zip'
Content type 'application/zip' length 2444938 bytes (2.3 MB)
downloaded 2.3 MB
package ‘colorspace’ successfully unpacked and MD5 sums checked
package ‘lazyeval’ successfully unpacked and MD5 sums checked
package ‘reshape2’ successfully unpacked and MD5 sums checked
package ‘munsell’ successfully unpacked and MD5 sums checked
package ‘ggplot2’ successfully unpacked and MD5 sums checked
package ‘plyr’ successfully unpacked and MD5 sums checked
package ‘scales’ successfully unpacked and MD5 sums checked
package ‘cowplot’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: data.table
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/data.table_1.11.4.zip'
Content type 'application/zip' length 1813443 bytes (1.7 MB)
downloaded 1.7 MB
package ‘data.table’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Skipping install of 'ggplot2' from a cran remote, the SHA1 (3.0.0) has not changed since last install.
Use force = TRUE
to force installation
Installing 1 package: Rcpp
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB
package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: RcppArmadillo
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependency ‘Rcpp’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RcppArmadillo_0.8.600.0.0.zip'
Content type 'application/zip' length 2251246 bytes (2.1 MB)
downloaded 2.1 MB
package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: RSpectra
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘Rcpp’, ‘RcppEigen’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RcppEigen_0.3.3.4.0.zip'
Content type 'application/zip' length 2663519 bytes (2.5 MB)
downloaded 2.5 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RSpectra_0.13-1.zip'
Content type 'application/zip' length 1234715 bytes (1.2 MB)
downloaded 1.2 MB
package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
package ‘RcppEigen’ successfully unpacked and MD5 sums checked
package ‘RSpectra’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
"C:/PROGRA1/R/R-341.3/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
"C:/Users/dasj/AppData/Local/Temp/RtmpKIyHpJ/devtools2c483778d3/privefl-bigstatsr-c6410a9" --library="C:/Users/dasj/Documents/R/win-library/3.4" --install-tests
ERROR: dependency 'Rcpp' is not available for package 'bigstatsr'
@waldnerf: I moved your new question here.
Since I installed "bigstatsr" I cannot complie cpp code using SourceCpp anymore. I have the following error message:
Error 1 occurred building shared library.
In file included from /home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/noncopyable.hpp(15),
from /home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(11),
from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/core/noncopyable.hpp(42): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr
BOOST_CONSTEXPR noncopyable() = default;
^
In file included from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(148): warning #858: type qualifier on return type is meaningless
const bool read_only() const
^
In file included from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(160): warning #858: type qualifier on return type is meaningless
const index_type allocation_size() const {return _allocationSize;}
^
compilation aborted for euc_dist.cpp (code 2)
make: *** [euc_dist.o] Error 2
icpc -std=gnu++11 -I/apps/R/3.4.0/lib64/R/include -DNDEBUG -I"/apps/R/3.4.0/lib64/R/library/Rcpp/include" -I"/apps/R/3.4.0/lib64/R/library/RcppArmadillo/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigstatsr/include" -I"/OSM/CBR/AF_DIGI_RS/work/projects/opt_sampling/Jaredfiles" -I/usr/local/include -fpic -O3 -fopenmp -xHOST -fp-model precise -c euc_dist.cpp -o euc_dist.o
/apps/R/3.4.0/lib64/R/etc/Makeconf:168: recipe for target 'euc_dist.o' failed
Error in sourceCpp(file.path(root_dir, "Jaredfiles/euc_dist.cpp")) :
Error 1 occurred building shared library.
Any thoughts on how to solve this?
I have some memory issue when running:
prod=big_cprodMat(m1, m2, ind.row=rows, ind.col=cols)
m1 is FBM object with dim: 20000 * 10^6, ind.row has 3000 elements and ind.col include all columns.
m2 is matrix with dim: 3000 * 20
prod has dimension 10^6 * 20.
During execution of the method, the memory consumption keeps growing up to 100G, although the final result is quite small, and the memory won't free up even after "gc". I am wondering if there is potential memory leak, and whether memory consumption can be managed more effectively.
For univariate logistic regression (big_univLogReg
), maybe it could be improved (faster? always converge?).
See this SO question.
Currently, you can use big_crossprodSelf()
, big_prodMat()
, etc.
Those functions are useful because they provide matrix products with scaling and subsetting.
Yet, if you don't care about scaling and subsetting, they are sub-optimal.
Would be good to warn users when such case happens.
How to detect that?
Any reason why FBM
s don't do colnames? If no, interested in a PR?
Find a faster SVM algorithm (that may also use strong rules).
bigstatsr::AUC(c(0, 1, NA), c(1, 0, 1)) ## 0.5
Should not allow missing values in the first place.
library(bigstatsr)
X <- big_attachExtdata()
y <- sample(0:1, nrow(X), TRUE)
head(big_univLogReg(X, y, covar = X[, 1, drop = FALSE], ncores = 1))
Shoud be easy for a small number of variables (say < 5000).
I have problem with the installation of the package. Here is the error message , do you have an idea of what failed ?
Thank you
Fabienne
githubinstall("bigstatsr")
ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lquadmath
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [bigstatsr.so] Error 1
ERROR: compilation failed for package ‘bigstatsr’
Do it in scaling functions?
This is a problem when using e.g. big_randomSVD()
because it would either stop with error TridiagEigen: eigen decomposition failed
or run for an infinite time.
Reprex:
library(bigstatsr)
X <- FBM(20, 20, init = rnorm(400))
svd <- big_randomSVD(X, big_scale())
X[, 1] <- 0
svd2 <- big_randomSVD(X, big_scale())
Is it possible to add a method remove_columns()
to FBM
s that does the inverse of add_columns()
, i.e. removes n columns from the end of of an FBM
? Or are there technical reasons why this is not implemented?
Implement this:
library(bigstatsr)
A <- FBM(2, 2, init = 1)
B <- FBM(4, 2)
B[1:2, ] <- A
In test-FBM-convert.R
(branch replace-df
)
I have an FBM with 33k rows and 200k columns. The entries are actually chi square values with different degrees of freedom for different rows. I am converting them to standard normal values by using big_parallelize.
My machine:
macOS 10.13.6
128GB RAM
nb_cores() -> 12
Using the code below, big_parallelize forks several (12) processes which all stay at at CPU usage of 0.1% to 0.2%. Each process initially gets allocated ~4GB RAM. Over the course of an hour or two RAM usage climbs, particularly for kernel_task, eventually resulting in the processes RAM being mostly compressed memory and all RAM on the machine is used. I'm forced to kill the R process at this point.
inter.fbm <- FBM(n.path.snps, n.host.snps, backingfile="../Data/interaction_matrix_new", create_bk=FALSE)
chi2z.par <- function(X, ind, df, Out.mat) {
chi2Z <- function(chi, df) {
qnorm(pchisq(chi, df=df))
}
chi2z.col <- function(M, df) {
apply(M, 2, chi2Z, df=df)
}
Out.mat[ , ind] <- chi2z.col(X[ , ind], df=df)
NULL
}
destfile <- "../Data/z_mat"
res.mat <- FBM(nrow(inter.fbm), ncol(inter.fbm), backingfile=destfile)
big_parallelize(inter.fbm, p.FUN=chi2z.par, ncores=nb_cores(), df=row.info$df, Out.mat=res.mat)
I've been trying out the big_parallelize function on a FBM and my outputs were combined in a very strange way. The function defined in big_parallelize outputs of vector of the same column size as the input matrix. I would like the output to be combined using rbind so that my output matrix has the same size as the input matrix. Strangely enough, big_parallelize treats each individual row as a vector and combines them using c(). rbind combines only the output of multiple cpus and if their size differ (which happens when nrow/ncpus is not a whole number) some outputs are removed.
Any chance you can get that fixed (I love your package otherwise)
Use {fpeek} + {data.table} by blocks (with big_apply
).
I'd like to make a flexible linear model which takes into account interactions with external factors. Does your package provide a linear regression function like that (is big_univLinReg flexible with formulas?)?
if not, can you recommend a flexible linear model function that can work with a bigSNP from the bigsnpr package?
If there is, how does it work ?
thanks alot
When I run the preprocessing R codes, it ends with: "Error: Failed to open data/POPRES_allchr_QC.bed.log Try changing the --out parameter." It might be Plink is not allowing data access. I could not access all the datasets used in the examples.
Merge all K
model coefficients.
Get rid of warn
and return.all
by always returning everything.
The more info, the better, without having to rerun everything.
Also return the number of predictors for each step.
Am I right that there is no direct way of getting x %*% y when both x and y are FBMs? To put it another way, what is the best way of multiplying two FBMs? Apologies if I have missed something obvious in the documentation, but from what I can %*% only works when you have an FBM and a vector/matrix. Any help would be much appreciated.
Dear Florian Price,
Thanks for the great package!
I am attempting to write large dist objects to disk in parts and then read subsets of the distance matrix. Does bigstatsr
support writing/reading symmetric matrices? Are there any packages/tools which already do it?
Regards,
Srikanth KS
In your vignette about bigstatsr & bigmemory, you mentioned that you felt the need to remove bigmemory and implement custom, bigmemory-like FBM. Could it be possible to explain why? While the package is named "bigstatsr" with implication that this belongs in the "big" family of packages, it doesn't use bigmemory as foundation -- this is a bit confusing. It would also be great for users to understand why one might use FBM vs bigmemory.
Does using read/write accessors is slower than using read-only accessors? (e.g. due to false sharing, ...).
Maybe train models with decreasing alpha.
Stop when not improving.
Find a smart solution for performing a quick multi-traits GWAS, based on a modified version of big_univ[Lin/Log]Reg()
.
reprex:
library(bigstatsr)
fbm <- FBM(10, 10, init = 1)
ind.col <- integer()
A <- matrix(1, nrow = 0, ncol = 10)
fbm[, ind.col] %*% A # 10x10 matrix with zeros
big_prodMat(fbm, A, ind.col = ind.col) # obscure error
Make sure to use it in predict()
if it was use in the fitting.
Would be good for the installed size. And no more WARNING on CRAN.
Is it possible to store FBM as float?
This might bring down the file size of double FBM by 2. (User may be warned about precision-loss when converted to R numeric/matrix during creation of FBM)
I got once this bug from here: https://github.com/privefl/bigstatsr/blob/master/tests/testthat/test-spLinReg.R#L71.
Need to be able to reproduce this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.