Git Product home page Git Product logo

bigstatsr's People

Contributors

jeroen avatar katrinleinweber avatar privefl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bigstatsr's Issues

argument "a.combine" is missing, with no default

Example:

library(bigstatsr)

X <- FBM(10, 10)

test <- big_apply(X, a.FUN = function(X, ind) {
  colSums(X[, ind, drop = FALSE])
}, block.size = 2)

is returning the error.

In foreach, you can give no value to the .combine parameter, and it will return a list by default.
In bigstatsr, for e.g. big_apply, as the computation is done by blocks, the parameter a.combine isn't intended to be left blank.

If you really want to do this, you can wrap the results of a.FUN with list() or as.list() depending on what you want and the class of objects you return, and use a.combine = 'c'.

Keep attributes when subsetting big_sp_list

library(bigstatsr)
set.seed(1)

# simulating some data
N <- 230
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y <- rowSums(X[, 1:10]) + rnorm(N)
covar <- matrix(rnorm(N * 3), N)

ind.train <- sort(sample(nrow(X), 150))
ind.test <- setdiff(rows_along(X), ind.train)

# fitting model for multiple lambdas and alphas
test <- big_spLinReg(X, y[ind.train], ind.train = ind.train,
                     covar.train = covar[ind.train, ],
                     alphas = c(1, 0.5, 0.1, 0.01))
# peek at the models
plot(test)
summary(test)
str(attributes(test))  # OK
attributes(test[1])  # NULL

big_apply and big_parallelize can't find function

Functions big_apply and big_parallelize can't find functions that I have defined. For instance,

my.sum <- function(M) {
    colSums(M)
}
mat.fb <- big_attachExtdata()
big_parallelize(mat.fb, p.FUN = function(X, ind) my.sum(X[ , ind]), ncores=2)

results in the error message Error in { : task 1 failed - "could not find function "my.sum"". This can be worked around by defining my.sum in the anonymous function for p.FUN but since the functions I want to do this for may be rather more complicated and used in other settings, it seems this should be unnecessary.

Fast approximation of biglasso?

Find if there is a fast near-optimal rule approximation for computing multivariate linear/logistic regression on biobank-scale datasets in a few hours (or minutes).

fast method to write sparse matrix to file

bigstatsr is super convenient if a large matrix is already on file but this is often not the case for me. I typically want to use bigstatsr after I've created a larger-than-memory matrix in R - which is usually in the form of a very sparse matrix (>95% sparsity). Writing to disk is a bottleneck because I have to iterate through the spares matrix by row-chunks (using lapply or equivalent), convert to non-sparse matrix, and write each chunk to file using append=TRUE. A faster method for doing this would be really helpful. Thanks!

Installation problem - Rtools is required on Windows

Hello,
You have been kind and answeded my beginner questions on stackoverflow regarding fopreach questions and parallellisationi, my alias there is Newbie80. I have been Reading your intro to foreach and now I am trying to install bigstatsr with the following

For the current development version

devtools::install_github("privefl/bigstatsr")

However I get the following message when I try to install, do you have any idea why?

Regards
Dan

devtools::install_github("privefl/bigstatsr")
Downloading GitHub repo privefl/bigstatsr@master
from URL https://api.github.com/repos/privefl/bigstatsr/zipball/master
trying URL 'https://cran.rstudio.com/bin/windows/Rtools/Rtools34.exe'
Content type 'application/x-msdos-program' length 108085090 bytes (103.1 MB)
downloaded 103.1 MB

WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install Rtools 3.4 from http://cran.r-project.org/bin/windows/Rtools/.
Installation failed: Could not find build tools necessary to build bigstatsr

library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
mat3 <- FBM(5, 8)
Error in FBM(5, 8) : could not find function "FBM"
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
tmp3 <- foreach(j = 1:8, .combine = 'c') %:%

  • foreach(i = 1:5, .combine = 'c') %dopar% {
  • mat3[i, j] <- i + j
    
  • NULL
    
  • }
    Error in { : task 1 failed - "object 'mat3' not found"

parallel::stopCluster(cl)
mat3[]
Error: object 'mat3' not found
mat3 <- FBM(5, 8)
Error in FBM(5, 8) : could not find function "FBM"
library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
install.packages("C:/Users/dasj/Desktop/bigstatsr-master.zip", repos = NULL, type = "win.binary")
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
library(bigstatsr)
Error in library(bigstatsr) : there is no package called ‘bigstatsr’
install.packages("C:/Users/dasj/Desktop/bigstatsr-master/bigstatsr-master/bigstatsr.Rproj", repos = NULL)
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
Error in install.packages : type == "both" cannot be used with 'repos = NULL'
devtools::install_github("privefl/bigstatsr")
Downloading GitHub repo privefl/bigstatsr@master
from URL https://api.github.com/repos/privefl/bigstatsr/zipball/master
Installing bigstatsr
Installing 1 package: BH
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/BH_1.66.0-1.zip'
Content type 'application/zip' length 17880018 bytes (17.1 MB)
downloaded 17.1 MB

package ‘BH’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: cowplot
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘colorspace’, ‘lazyeval’, ‘reshape2’, ‘munsell’, ‘ggplot2’, ‘plyr’, ‘scales’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/colorspace_1.3-2.zip'
Content type 'application/zip' length 445654 bytes (435 KB)
downloaded 435 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/lazyeval_0.2.1.zip'
Content type 'application/zip' length 140066 bytes (136 KB)
downloaded 136 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/reshape2_1.4.3.zip'
Content type 'application/zip' length 611951 bytes (597 KB)
downloaded 597 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/munsell_0.5.0.zip'
Content type 'application/zip' length 220479 bytes (215 KB)
downloaded 215 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/ggplot2_3.0.0.zip'
Content type 'application/zip' length 3146953 bytes (3.0 MB)
downloaded 3.0 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/plyr_1.8.4.zip'
Content type 'application/zip' length 1219076 bytes (1.2 MB)
downloaded 1.2 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/scales_0.5.0.zip'
Content type 'application/zip' length 693959 bytes (677 KB)
downloaded 677 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/cowplot_0.9.2.zip'
Content type 'application/zip' length 2444938 bytes (2.3 MB)
downloaded 2.3 MB

package ‘colorspace’ successfully unpacked and MD5 sums checked
package ‘lazyeval’ successfully unpacked and MD5 sums checked
package ‘reshape2’ successfully unpacked and MD5 sums checked
package ‘munsell’ successfully unpacked and MD5 sums checked
package ‘ggplot2’ successfully unpacked and MD5 sums checked
package ‘plyr’ successfully unpacked and MD5 sums checked
package ‘scales’ successfully unpacked and MD5 sums checked
package ‘cowplot’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: data.table
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/data.table_1.11.4.zip'
Content type 'application/zip' length 1813443 bytes (1.7 MB)
downloaded 1.7 MB

package ‘data.table’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Skipping install of 'ggplot2' from a cran remote, the SHA1 (3.0.0) has not changed since last install.
Use force = TRUE to force installation
Installing 1 package: Rcpp
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: RcppArmadillo
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependency ‘Rcpp’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RcppArmadillo_0.8.600.0.0.zip'
Content type 'application/zip' length 2251246 bytes (2.1 MB)
downloaded 2.1 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
Installing 1 package: RSpectra
Installing package into ‘C:/Users/dasj/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘Rcpp’, ‘RcppEigen’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/Rcpp_0.12.17.zip'
Content type 'application/zip' length 4374761 bytes (4.2 MB)
downloaded 4.2 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RcppEigen_0.3.3.4.0.zip'
Content type 'application/zip' length 2663519 bytes (2.5 MB)
downloaded 2.5 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RSpectra_0.13-1.zip'
Content type 'application/zip' length 1234715 bytes (1.2 MB)
downloaded 1.2 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
package ‘RcppEigen’ successfully unpacked and MD5 sums checked
package ‘RSpectra’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\dasj\AppData\Local\Temp\RtmpKIyHpJ\downloaded_packages
"C:/PROGRA1/R/R-341.3/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
"C:/Users/dasj/AppData/Local/Temp/RtmpKIyHpJ/devtools2c483778d3/privefl-bigstatsr-c6410a9" --library="C:/Users/dasj/Documents/R/win-library/3.4" --install-tests

ERROR: dependency 'Rcpp' is not available for package 'bigstatsr'

  • removing 'C:/Users/dasj/Documents/R/win-library/3.4/bigstatsr'
    In R CMD INSTALL
    Installation failed: Command failed (1)

Rcpp installation error since bigstatsr installation

@waldnerf: I moved your new question here.

Since I installed "bigstatsr" I cannot complie cpp code using SourceCpp anymore. I have the following error message:

 Error 1 occurred building shared library.

In file included from /home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/noncopyable.hpp(15),
                 from /home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(11),
                 from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include/boost/core/noncopyable.hpp(42): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr
        BOOST_CONSTEXPR noncopyable() = default;
                        ^

In file included from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(148): warning #858: type qualifier on return type is meaningless
      const bool read_only() const
      ^

In file included from euc_dist.cpp(14):
/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include/bigmemory/BigMatrix.h(160): warning #858: type qualifier on return type is meaningless
      const index_type allocation_size() const {return _allocationSize;}
      ^


compilation aborted for euc_dist.cpp (code 2)
make: *** [euc_dist.o] Error 2
icpc -std=gnu++11 -I/apps/R/3.4.0/lib64/R/include -DNDEBUG   -I"/apps/R/3.4.0/lib64/R/library/Rcpp/include" -I"/apps/R/3.4.0/lib64/R/library/RcppArmadillo/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/BH/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigmemory/include" -I"/home/wal716/R/x86_64-pc-linux-gnu-library/3.4/bigstatsr/include" -I"/OSM/CBR/AF_DIGI_RS/work/projects/opt_sampling/Jaredfiles" -I/usr/local/include   -fpic  -O3 -fopenmp -xHOST -fp-model precise -c euc_dist.cpp -o euc_dist.o
/apps/R/3.4.0/lib64/R/etc/Makeconf:168: recipe for target 'euc_dist.o' failed
Error in sourceCpp(file.path(root_dir, "Jaredfiles/euc_dist.cpp")) :
  Error 1 occurred building shared library.

Any thoughts on how to solve this?

big_cprodMat memory issue

I have some memory issue when running:
prod=big_cprodMat(m1, m2, ind.row=rows, ind.col=cols)

m1 is FBM object with dim: 20000 * 10^6, ind.row has 3000 elements and ind.col include all columns.
m2 is matrix with dim: 3000 * 20
prod has dimension 10^6 * 20.
During execution of the method, the memory consumption keeps growing up to 100G, although the final result is quite small, and the memory won't free up even after "gc". I am wondering if there is potential memory leak, and whether memory consumption can be managed more effectively.

Make %*%, crossprod() and tcrossprod() available for FBMs

Currently, you can use big_crossprodSelf(), big_prodMat(), etc.
Those functions are useful because they provide matrix products with scaling and subsetting.

Yet, if you don't care about scaling and subsetting, they are sub-optimal.

colnames

Any reason why FBMs don't do colnames? If no, interested in a PR?

Problem to install: OS X "-lgfortran" and "-lquadmath" errors

I have problem with the installation of the package. Here is the error message , do you have an idea of what failed ?

Thank you
Fabienne

githubinstall("bigstatsr")

ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lquadmath
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [bigstatsr.so] Error 1
ERROR: compilation failed for package ‘bigstatsr’

  • removing ‘/Library/Frameworks/R.framework/Versions/3.3/Resources/library/bigstatsr’
    Installation failed: Command failed (1)

Check for variables with no variation

Do it in scaling functions?

This is a problem when using e.g. big_randomSVD() because it would either stop with error TridiagEigen: eigen decomposition failed or run for an infinite time.

Reprex:

library(bigstatsr)
X <- FBM(20, 20, init = rnorm(400))
svd <- big_randomSVD(X, big_scale())
X[, 1] <- 0
svd2 <- big_randomSVD(X, big_scale())

Remove columns

Is it possible to add a method remove_columns() to FBMs that does the inverse of add_columns(), i.e. removes n columns from the end of of an FBM? Or are there technical reasons why this is not implemented?

Low CPU, high RAM usage with big_parallelize

I have an FBM with 33k rows and 200k columns. The entries are actually chi square values with different degrees of freedom for different rows. I am converting them to standard normal values by using big_parallelize.
My machine:
macOS 10.13.6
128GB RAM
nb_cores() -> 12

Using the code below, big_parallelize forks several (12) processes which all stay at at CPU usage of 0.1% to 0.2%. Each process initially gets allocated ~4GB RAM. Over the course of an hour or two RAM usage climbs, particularly for kernel_task, eventually resulting in the processes RAM being mostly compressed memory and all RAM on the machine is used. I'm forced to kill the R process at this point.

inter.fbm <- FBM(n.path.snps, n.host.snps, backingfile="../Data/interaction_matrix_new", create_bk=FALSE)
chi2z.par <- function(X, ind, df, Out.mat) {
    chi2Z <- function(chi, df) {
        qnorm(pchisq(chi, df=df))
    }
    chi2z.col <- function(M, df) {
        apply(M, 2, chi2Z, df=df)
    }
    Out.mat[ , ind] <- chi2z.col(X[ , ind], df=df)
    NULL
}

destfile <- "../Data/z_mat"
res.mat <- FBM(nrow(inter.fbm), ncol(inter.fbm), backingfile=destfile)
big_parallelize(inter.fbm, p.FUN=chi2z.par, ncores=nb_cores(), df=row.info$df, Out.mat=res.mat)

Error in the p.combine of big_parellelize

I've been trying out the big_parallelize function on a FBM and my outputs were combined in a very strange way. The function defined in big_parallelize outputs of vector of the same column size as the input matrix. I would like the output to be combined using rbind so that my output matrix has the same size as the input matrix. Strangely enough, big_parallelize treats each individual row as a vector and combines them using c(). rbind combines only the output of multiple cpus and if their size differ (which happens when nrow/ncpus is not a whole number) some outputs are removed.
Any chance you can get that fixed (I love your package otherwise)

a flexible linear model for bigSNP object

I'd like to make a flexible linear model which takes into account interactions with external factors. Does your package provide a linear regression function like that (is big_univLinReg flexible with formulas?)?
if not, can you recommend a flexible linear model function that can work with a bigSNP from the bigsnpr package?
If there is, how does it work ?
thanks alot

Remove parameters from big_sp***Reg()

Get rid of warn and return.all by always returning everything.
The more info, the better, without having to rerun everything.
Also return the number of predictors for each step.

FBM-by-FBM multiplication

Am I right that there is no direct way of getting x %*% y when both x and y are FBMs? To put it another way, what is the best way of multiplying two FBMs? Apologies if I have missed something obvious in the documentation, but from what I can %*% only works when you have an FBM and a vector/matrix. Any help would be much appreciated.

Ideas for storing/accessing large distance matrices

Dear Florian Price,

Thanks for the great package!
I am attempting to write large dist objects to disk in parts and then read subsets of the distance matrix. Does bigstatsr support writing/reading symmetric matrices? Are there any packages/tools which already do it?

Regards,
Srikanth KS

Reason for decoupling from bigmemory?

In your vignette about bigstatsr & bigmemory, you mentioned that you felt the need to remove bigmemory and implement custom, bigmemory-like FBM. Could it be possible to explain why? While the package is named "bigstatsr" with implication that this belongs in the "big" family of packages, it doesn't use bigmemory as foundation -- this is a bit confusing. It would also be great for users to understand why one might use FBM vs bigmemory.

Multi-traits GWAS

Find a smart solution for performing a quick multi-traits GWAS, based on a modified version of big_univ[Lin/Log]Reg().

Handle 0-dimension matrix in big_prodMat

reprex:

library(bigstatsr)

fbm <- FBM(10, 10, init = 1)
ind.col <- integer()
A <- matrix(1, nrow = 0, ncol = 10)

fbm[, ind.col] %*% A                     # 10x10 matrix with zeros

big_prodMat(fbm, A, ind.col = ind.col)   # obscure error

Support for float (single precision)

Is it possible to store FBM as float?

This might bring down the file size of double FBM by 2. (User may be warned about precision-loss when converted to R numeric/matrix during creation of FBM)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.