openmendel / mendeliht.jl Goto Github PK

Iterative hard thresholding for l0 penalized regression

License: Other

Julia 88.61% Shell 8.90% Python 2.49%

julia gwas model-selection genetic-algorithms glm multivariate regression sparsity multivariate-regression sparse-linear-solver

mendeliht.jl's Introduction

MendelIHT

Iterative hard thresholding - a multiple regression approach to analyze data from a Genome Wide Association Studies (GWAS)

Documentation	Build Status	Code Coverage

Installation

Download and install Julia. Within Julia, copy and paste the following:

using Pkg
pkg"add MendelIHT"

This package supports Julia v1.6+ for Mac, Linux, and window machines.

Documentation

Quick start

Sparse linear regression:

using MendelIHT, Random

# simulate data
n = 200    # sample size
p = 1000   # number of covariates
k = 10     # number of causal variables
x = randn(n, p) # simulate x
β = zeros(p) # simulate beta
β[1:k] .= randn(k)
shuffle!(β)
true_position = findall(!iszero, β)
y = x * β + randn(n) # simulate y

# run IHT
possible_k = collect(0:20)
mses = cv_iht(y, x, path=possible_k) # cross validate k = 0, 1, 2, ..., 20
result = fit_iht(y, x, k=possible_k[argmin(mses)]) # run IHT on best k
[result.beta[true_position] β[true_position]] # compare true vs estimated beta

10×2 Matrix{Float64}:
  0.41449    0.343562
 -0.248449  -0.222586
  0.0       -0.12781
 -0.89703   -0.927769
 -1.18703   -1.15052
  0.0       -0.0746511
  2.48838    2.4621
  0.0       -0.0712048
 -1.5504    -1.59528
 -1.01247   -1.05913

Sparse logistic regression:

using MendelIHT, Random, GLM

# simulate data
n = 200    # sample size
p = 1000   # number of covariates
k = 10     # number of causal variables
x = randn(n, p) # simulate x
β = zeros(p) # simulate beta
β[1:k] .= randn(k)
shuffle!(β)
true_position = findall(!iszero, β)
μ = GLM.linkinv.(LogitLink(), x * β)
y = [rand(Bernoulli(μi)) for μi in μ] |> Vector{Float64}

# run IHT
possible_k = collect(0:20)
mses = cv_iht(y, x, d=Bernoulli(), l=LogitLink(), path=possible_k)
result = fit_iht(y, x, k=possible_k[argmin(mses)], d=Bernoulli(), l=LogitLink())
[result.beta[true_position] β[true_position]]

10×2 Matrix{Float64}:
  0.0       0.315486
  1.27218   1.06696
  0.0       0.0819433
  0.0       0.381772
 -1.16612  -1.1422
  0.0      -0.260436
  0.0      -0.540831
 -2.23738  -2.37168
  0.0       0.43792
  1.50502   1.60719

GWAS Quick Start

The following uses data under the data directory. PLINK files are stored in normal.bed, normal.bim, normal.fam.

# load package
using MendelIHT
dir = normpath(MendelIHT.datadir()) * "/"

# select k SNPs in PLINK file, Gaussian phenotypes
result = iht(dir * "normal", 9, Normal) # run IHT with k = 9
result = iht(dir * "normal", 10, Normal, covariates=dir*"covariates.txt") # separately include covariates, k = 10
result = iht(dir * "normal", 10, Normal, covariates=dir*"covariates.txt", phenotypes=dir*"phenotypes.txt") # phenotypes are stored separately

# run cross validation to determine best k
mses = cross_validate(dir * "normal", Normal, path=1:20) # test k = 1, 2, ..., 20
mses = cross_validate(dir * "normal", Normal, path=[1, 5, 10, 15, 20]) # test k = 1, 5, 10, 15, 20
mses = cross_validate(dir * "normal", Normal, path=1:20, covariates=dir*"covariates.txt") # separately include covariates
mses = cross_validate(dir * "normal", Normal, path=1:20, covariates=dir*"covariates.txt", phenotypes=dir*"phenotypes.txt") # if phenotypes are in separate file

# Multivariate IHT for multiple quantitative phenotypes
result = iht(dir * "multivariate", 10, MvNormal, phenotypes=[6, 7]) # phenotypes stored in 6th and 7th column of .fam file
result = iht(dir * "multivariate", 10, MvNormal, phenotypes=dir*"multivariate.phen") # phenotypes stored separate file

# other distributions for single trait analysis (no test data available)
result = iht("datafile", 10, Bernoulli) # logistic regression with k = 10
result = iht("datafile", 10, Poisson) # Poisson regression with k = 10
result = iht("datafile", 10, NegativeBinomial, est_r=:Newton) # Negative Binomial regression + nuisnace parameter estimation

Please see our latest documentation for more detail.

Citation and Reproducibility:

For univariate analysis, please cite:

Chu BB, Keys KL, German CA, Zhou H, Zhou JJ, Sobel EM, Sinsheimer JS, Lange K. Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity. Gigascience. 2020 Jun 1;9(6):giaa044. doi: 10.1093/gigascience/giaa044. PMID: 32491161; PMCID: PMC7268817.

In the figures subfolder, one can find all the code to reproduce the figures and tables in our paper.

For multivariate analysis, please cite:

Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, Lange K. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics. 2023 Apr;39(4):btad193

In the manuscript subfolder, one can find all the code to reproduce the figures and tables in our paper.

Bug fixes and user support

If you encounter a bug or need user support, please open a new issue on Github. Please provide as much detail as possible for bug reports, ideally a sequence of reproducible code that lead to the error.

PRs and feature requests are welcomed!

Acknowledgments

This project has been supported by the National Institutes of Health under awards R01GM053275, R01HG006139, R25GM103774, and 1R25HG011845.

mendeliht.jl's People

Stargazers

Watchers

Forkers

viviangarcia leticia-maria kose-y danieleweeks

mendeliht.jl's Issues

Do not perform selection on non-genetic covariates

Currently, sparsity is enforced on both the genetic and non-genetic covariates. It seems the standard is to not select for non-genetic covariates. So we need one of the following:

Change projection routine to exclude non-genetic covariates from being projected
Add routine to automatically regress out the effect of non-genetic covariates before running IHT

Probably option 1 is better

Roadmap for future development

Major Features

Extend IHT to Cox model.
Re-enable GPU computing as once provided by @klkeys.
Add option to possibly constrain IHT estimates to specified upper and lower bounds
Provide rowmask and colmask keyword telling IHT which samples/columns it is allowed to use
Add routines to grab biologically relevant information as input arguments for sparse group projections and knowledge aided projections, as previously investigated by @gdmosher.
Consider regularizing the covariance matrix for multivariate IHT

Minor Features:

Documentation and unit testing (these will always be a priority).
Add routine to handle interaction terms (SNP-SNP or SNP-environment)
Test and validate GLM code for Gamma, Inverse Gaussian, and Binomial regression
Develop our own fit function for debiasing step. Currently we rely on the fit function in the GLM.jl package which can sometimes suffer unexpected crashes. One solution is to use the glm_regress function in MendelBase.jl
Add routines to internally compute, and then use, the top principal components to account for sub-population structure
Add option in wrapper functions to possibly work on Float32 matrices instead of always defaulting to Float64

parallel computing in `cv_iht` is slow for floating point matrices.

As noted, currently the parallel computing part for the cross validation routine is slower than single core. I don't really know why at the moment, but I suspect it is caused oversubscription with pmap and BLAS mixed together..

For the moment, use parallel = false when you are using single/double-precision matrices.

Upper limit on possible_k

I was trying a small toy example based on your "Sparse linear regression" instructions, and mistakenly set the upper limit for possible_k to a value larger than the number of columns of x, and this led to kind of a long error message that took me a while to properly understand.

So consider adding an informative error check if someone inadvertently also makes this mistake.

possible_k = collect(0:9)
mses = cv_iht(y, x, path=possible_k)

Also, in your "Sparse linear regression" and "Sparse logistic regression" examples, consider removing the embedded "julia >" prompts, so it would be easier to paste this example code into a Julia session.

Thank you,
Dan

Precompilation takes forever

Some recent changes caused precompilation to be unacceptably slow (200-500+ seconds). MWE

using MendelIHT
cd(normpath(MendelIHT.datadir()))

julia> @time mses = cross_validate("normal", Normal, path=1:20); 
245.610546 seconds (152.45 M allocations: 6.990 GiB, 1.27% gc time, 1.42% compilation time)

julia> @time mses = cross_validate("normal", Normal, path=1:20); 
 11.852532 seconds (22.98 M allocations: 531.709 MiB, 2.26% gc time)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Unit tests dependent on random seeds

There are some unit tests relying on specific seeds. This is discouraged because random number generation changes depending on Julia versions and package versions. In particular, there seems to be a change in RNG of Distributions v0.24 to v0.25, and Julia changes its default RNG in version 1.7.

Documenation of the Random module (https://docs.julialang.org/en/v1/stdlib/Random/) reads:

Because the precise way in which random numbers are generated is considered an implementation detail, bug fixes and speed improvements may change the stream of numbers that are generated after a version change. Relying on a specific seed or generated stream of numbers during unit testing is thus discouraged - consider testing properties of the methods in question instead.

Use SnpLinAlg for sparse models as well

Currently, xk in IHTVariable (and Xk in mIHTVariable) is a n by k Matrix{T}. For highly polygenic traits like height, k could be really large (>10000), which for large sample sizes requires a lot of memory.

For cross validation, we could recommend users to cross validate only one (or a few) k at once. However, for the long run, we should consider using SnpLinAlg for xk.

AppVeyor not installing SnpArrays correctly

Related to #6.
It seems like the build on Appveyor is failing because it could not find the correct version of SnpArrays.jl.
https://ci.appveyor.com/project/biona001/mendeliht-jl
This script should be modified to install SnpArrays.jl directly from github.

Note: It will still fail to build on AppVeyor even after the fix, as SnpArrays.jl currently fails on AppVeyor.

using MendelIHT

in an R Studio julia chunk, I get this warning/error message:

WARNING: both BGEN and SnpArrays export "maf"; uses of it in module MendelIHT must be qualified
Error: Error happens in Julia.
InitError: TaskFailedException

I guess it is just a 'warning' and so can be ignored, but of course the subsequent lines make it sound like it is an error that perhaps I should be concerned about.

But then if I do

using MendelIHT

a second time, it seems that MendelIHT is loaded OK.

This is with version 1.4.7:

> julia_installed_package("MendelIHT")
[1] "1.4.7"

But strangely I don't get this warning when starting up in the Terminal at the command line:

$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using MendelIHT

julia>

Here's the full error message obtained when starting up MendelIHT via JuliaCall:

WARNING: both BGEN and SnpArrays export "maf"; uses of it in module MendelIHT must be qualified
Error: Error happens in Julia.
InitError: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:345 [inlined]
  [2] fetch
    @ ./task.jl:360 [inlined]
  [3] take!(pool::ThreadPools.QueuePool)
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/qpool.jl:96
  [4] tmap(fn::MendelIHT.var"#71#76"{Float64, Vector{Int64}, Bool, Bool, Int64, Int64, Bool, Vector{Float64}, Vector{Tuple{Int64, Int64}}, Nothing, Vector{MendelIHT.mIHTVariable{Float64, Transpose{Float64, SnpArrays.SnpLinAlg{Float64}}}}, Vector{BitVector}, Vector{BitVector}}, pool::ThreadPools.QueuePool, itr::UnitRange{Int64})
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/qpool.jl:202
  [5] tforeach(fn::Function, pool::ThreadPools.QueuePool, itr::UnitRange{Int64})
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/interface.jl:32
  [6] #70
    @ ~/.julia/packages/ThreadPools/ANo2I/src/macros.jl:18 [inlined]
  [7] twith
    @ ~/.julia/packages/ThreadPools/ANo2I/src/interface.jl:91 [inlined]
  [8] macro e

> julia_command("using InteractiveUtils; versioninfo()")
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.7.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] JuliaCall_0.17.5  pander_0.6.5      glmnet_4.1-6     
 [4] Matrix_1.5-3      parameters_0.20.2 tidylog_1.0.2    
 [7] lubridate_1.9.2   forcats_1.0.0     stringr_1.5.0    
[10] dplyr_1.1.0       purrr_1.0.1       readr_2.1.4      
[13] tidyr_1.3.0       tibble_3.1.8      ggplot2_3.4.1    
[16] tidyverse_2.0.0   knitr_1.42       

loaded via a namespace (and not attached):
 [1] shape_1.4.6        zoo_1.8-11         clisymbols_1.2.0  
 [4] tidyselect_1.2.0   xfun_0.37          splines_4.2.2     
 [7] lattice_0.20-45    colorspace_2.1-0   vctrs_0.5.2       
[10] generics_0.1.3     htmltools_0.5.4    yaml_2.3.7        
[13] utf8_1.2.3         survival_3.5-3     rlang_1.0.6       
[16] pillar_1.8.1       glue_1.6.2         withr_2.5.0       
[19] emmeans_1.8.4-1    multcomp_1.4-22    foreach_1.5.2     
[22] lifecycle_1.0.3    munsell_0.5.0      gtable_0.3.1      
[25] bayestestR_0.13.0  mvtnorm_1.1-3      evaluate_0.20     
[28] codetools_0.2-19   coda_0.19-4        fastmap_1.1.1     
[31] tzdb_0.3.0         datawizard_0.6.5   fansi_1.0.4       
[34] Rcpp_1.0.10        TH.data_1.1-1      xtable_1.8-4      
[37] scales_1.2.1       digest_0.6.31      hms_1.1.2         
[40] stringi_1.7.12     insight_0.19.0     grid_4.2.2        
[43] cli_3.6.0          tools_4.2.2        sandwich_3.0-2    
[46] magrittr_2.0.3     pkgconfig_2.0.3    ellipsis_0.3.2    
[49] MASS_7.3-58.2      estimability_1.4.1 timechange_0.2.0  
[52] rmarkdown_2.20     rstudioapi_0.14    iterators_1.0.14  
[55] R6_2.5.1           compiler_4.2.2

Minimal Working Example:

> Sys.setenv(JULIA_NUM_THREADS = 8)
> library(JuliaCall)
> julia <- julia_setup(JULIA_HOME = "/Applications/Julia-1.8.app/Contents/Resources/julia/bin")
> julia_command("using MendelIHT")
WARNING: both BGEN and SnpArrays export "maf"; uses of it in module MendelIHT must be qualified
Error: Error happens in Julia.
InitError: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:345 [inlined]
  [2] fetch
    @ ./task.jl:360 [inlined]
  [3] take!(pool::ThreadPools.QueuePool)
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/qpool.jl:96
  [4] tmap(fn::MendelIHT.var"#71#76"{Float64, Vector{Int64}, Bool, Bool, Int64, Int64, Bool, Vector{Float64}, Vector{Tuple{Int64, Int64}}, Nothing, Vector{MendelIHT.mIHTVariable{Float64, Transpose{Float64, SnpArrays.SnpLinAlg{Float64}}}}, Vector{BitVector}, Vector{BitVector}}, pool::ThreadPools.QueuePool, itr::UnitRange{Int64})
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/qpool.jl:202
  [5] tforeach(fn::Function, pool::ThreadPools.QueuePool, itr::UnitRange{Int64})
    @ ThreadPools ~/.julia/packages/ThreadPools/ANo2I/src/interface.jl:32
  [6] #70
    @ ~/.julia/packages/ThreadPools/ANo2I/src/macros.jl:18 [inlined]
  [7] twith
    @ ~/.julia/packages/ThreadPools/ANo2I/src/interface.jl:91 [inlined]
  [8] macro e