Git Product home page Git Product logo

saigegds's Introduction

SAIGEgds: Scalable Implementation of Generalized mixed models in PheWAS using GDS files

GPLv3 GNU General Public License, GPLv3

Features

Scalable implementation of generalized mixed mode with the support of Genomic Data Structure (GDS) files and highly optimized C++ implementation. It is designed for single variant tests in large-scale phenome-wide association studies (PheWAS) with millions of variants and hundreds of thousands of samples (e.g., UK Biobank genotype data), controlling for case-control imbalance and sample structure in single variant association studies.

The implementation of SAIGEgds is based on the original SAIGE R package (v0.29.4.4) [Zhou et al. 2018]. It is implemented with optimized C++ codes taking advantage of sparse structure of genotypes. All of the calculation with single-precision floating-point numbers in SAIGE are replaced by the double-precision calculation in SAIGEgds. SAIGEgds also implements some of the SPAtest functions in C to speed up the calculation of Saddlepoint approximation.

Benchmarks using the UK Biobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.

Bioconductor:

Release Version: v1.12.1 (http://www.bioconductor.org/packages/SAIGEgds)

Package Maintainer

Xiuwen Zheng

Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SAIGEgds")

The BiocManager::install() approach may require that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

  • Development version from Github (for developers/testers only)
library("devtools")
install_github("AbbVie-ComputationalGenomics/SAIGEgds")

Package vignette

If the package is installed from Bioconductor repository or package rebuilding, users can start R and enter to view documentation:

browseVignettes("SAIGEgds")

Examples

library(SeqArray)
library(SAIGEgds)

# open the GDS file for genetic relationship matrix (GRM)
grm_fn <- system.file("extdata", "grm1k_10k_snp.gds", package="SAIGEgds")
(grm_gds <- seqOpen(grm_fn))

# load phenotype
phenofn <- system.file("extdata", "pheno.txt.gz", package="SAIGEgds")
pheno <- read.table(phenofn, header=TRUE, as.is=TRUE)
head(pheno)
##   sample.id y     yy      x1 x2
## 1        s1 0 4.5542  1.5118  1
## 2        s2 0 3.7941  0.3898  1
## 3        s3 0 5.0411 -0.6212  1
## ...

# fit the null model
glmm <- seqFitNullGLMM_SPA(y ~ x1 + x2, pheno, grm_gds, trait.type="binary",
    sample.col="sample.id", num.thread=2)
## SAIGE association analysis:
## Filtering variants:
## [==================================================] 100%, completed, 0s
## Fit the null model: y ~ x1 + x2 + var(GRM)
##     # of samples: 1,000
##     # of variants: 9,976
##     using 2 threads
## ...

# close the file
seqClose(grm_gds)



################################

# open the GDS file for association testing
geno_fn <- system.file("extdata", "assoc_100snp.gds", package="SAIGEgds")
(geno_gds <- seqOpen(geno_fn))
## File: assoc_100snp.gds (10.5K)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 1000 LZMA_ra(12.6%), 625B }
## |--+ variant.id   { Int32 100 LZMA_ra(48.5%), 201B } *
## ...


# p-value calculation
assoc <- seqAssocGLMM_SPA(geno_gds, glmm, mac=10, parallel=2)
## SAIGE association analysis:
##     # of samples: 1,000
##     # of variants: 100
##     MAF threshold: NaN
##     MAC threshold: 10
##     missing threshold for variants: 0.1
##     p-value threshold for SPA adjustment: 0.05
##     variance ratio for approximation: 0.9391186
##     # of processes: 2
## [==================================================] 100%, completed, 0s
## # of variants after filtering by MAF, MAC and missing thresholds: 38
## Done.

head(assoc)
##   id chr pos rs.id ref alt AF.alt mac  num       beta        SE      pval pval.noadj converged
## 1  4   1   4   rs4   A   C 0.0100  20 1000  -0.074992  0.791685  0.924533   0.924533      TRUE
## 2 12   1  12  rs12   A   C 0.0150  30 1000  -0.091001  0.657140  0.889861   0.889861      TRUE
## 3 14   1  14  rs14   A   C 0.0375  75 1000  -0.075455  0.434152  0.862023   0.862023      TRUE
## ...


# close the file
seqClose(geno_gds)

Citations

Zheng X, Davis J.Wade. SAIGEgds -- an efficient statistical tool for large-scale PheWAS with mixed models. Bioinformatics (2020). DOI: 10.1093/bioinformatics/btaa731.

Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet (2018). Sep;50(9):1335-1341. DOI: 10.1038/s41588-018-0184-y.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D. SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics (2017). DOI: 10.1093/bioinformatics/btx145.

See Also

SeqArray: Data management of large-scale whole-genome sequence variant calls

gds2bgen: Format conversion from bgen to gds

saigegds's People

Contributors

jgrundstad avatar martin-g avatar zhengxw-ab avatar zhengxwen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

saigegds's Issues

Using SAIGEgds for non additive models

Hi,

I use SAIGEgds for smaller scale analyses and think it's a great tool (SNP numbers approximately 1000). Iwas wondering if SAIGEgds could be used for alternative models of inheritance. So, recoding the dosage to reflect a dominant model. That also got me thinking if it could also be used for genetic risk scores, in effect having a single "SNP" that reflected the risk score. I can see how this would be achieved using the gds structure and recoding but just wondered if you had any insight into whether this approach invalidated the underlying principles of the analysis.

Best,
Richard

Version of SAIGE used in SAIGEgds ?

Hi, this is a great idea, and really useful package. 2 quesions:

  1. I note from the README.md that this package is "based on the original SAIGE package (v0.29.4.4)". But I'm not certain if that means this package uses SAIGEv0.29.4.4. Does this mean it has not been updated as SAIGE has progressed forwards to 0.43.3? I can see a number of beneficial updates in the ChangeLog for SAIGE and I'm wondering if any of these have been implemented in this package as I can see there have been seperate improvements in this package (such as the use of ACAT-O, where SAIGE only allows SKAT-O).

  2. In SAIGE for gene-based testing there is an option minMAFforGRM to ensure only MAF>0.05 or 0.01 are used for creating the GRM. Is this automatically done in the SAIGEgds::seqFitNullGLMM_SPA() function? I can't see an argument for it in the man pages. Although, I see that the argument "variant.id" is used to select variants to be used for the GRM (after the pruning step in the tutorial), so an additional subset step on this character vector using allele frequency derived from the pruned gds could solve the problem? Please do let me know if I am missing something in this function! Thank you.

Thank you so much for your work on this!

Error: glm.fit: fitted probabilities numerically 0 or 1 occurred

I am trying to fit NULL model using GRM generated seq-array from genotyped data. I have binary phenotype (0/1) for ~6K individuals.
When I try to fir null model I get error for probabilities numerically. Below is the code:

library(SeqArray)
library(SAIGEgds)

pheno <-read.table("phenotype_04.28.2021.txt", header=TRUE)

glmm <- seqFitNullGLMM_SPA(AD ~ SEX+ AGE + PC1 +PC2+ PC3 , pheno, "seqarray_grm_geno.gds" , trait.type="binary",sample.col="TOPMED_IID", num.thread=10)

Error:

	Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Invalid tbb::this_task_arena::current_thread_index()!
Aborted (core dumped)

Is this related to input data type or low memory?

From vcf to bgen to gds? dosage value or genotype probability value?

Hi team,

I am interested in using this library for an admixed population with case-control imbalance. Data were imputed using TOPMED panel michigan imputation server.
I would like to know during analysis are dosage information used or, genotype probability, or genotype value used from the input GDS file?

Also, what is an ideal way to perform data transformation without much hiccups to the GDS? Using PLINK (v2.0) I can convert imputed data to the .bgen format. Is "--export bgen-1.2" right file format when using PLINK?

thank you,

GDS sample IDs problem

Hi!
I fitted the null model and calculated association p-values for one dataset and then tried with the same dataset but changing the sample IDs (.fam file was modified first and then converted into GDS format with the seqArray function).

Coefficients and variance ratio in the null model aren't the same even though the dataset is, and association results vary too much.
Only thing that is different between the .gds files is the "sample ID" annotation branch. I repeated the first analysis (with the original sample IDs) three times and I get the same results (which I think are ok), so I believe the issue comes with the new sample ID names.

I would appreciate any help.
Thank you

Results: beta (odds ratios) < 0

In the result, "beta" indicates : beta coefficient, odds ratio if binary outcomes (alternative allele vs. reference
allele). However, when I conduct GWAS on binary outcomes, there are some SNPs with beta < 0. It's impossible that the odds ratio is < 0. Would you mind me asking how to interpret it? Should I delete SNPs with beta < 0?

   id chr      pos rs.id ref alt       AF.alt   mac   num        beta         SE       pval
1:  3  22 16868201         C   A 0.0004218781    17 20148  0.95400134 1.41909658 0.50141843
2:  5  22 16868590         G   A 0.0005707763    23 20148 -1.06041310 1.18949813 0.37267203
3:  6  22 16868603         C   T 0.0003226127    13 20148 -1.06501842 1.50172130 0.47820133
4:  7  22 16868711         G   A 0.0002977963    12 20148 -1.05899912 1.67452636 0.52711436
5: 12  22 16869223         G   A 0.2621351995 10563 20148 -0.12863379 0.06312118 0.04156131

Error: is not a SeqArray GDS file || kinship using genotyped data

create GRM in .gds file format


library(SNPRelate)

bed.fn<-"mafed_LDed.bed"
fam.fn<-"mafed_LDed.fam"
bim.fn<-"mafed_LDed.bim"

# convert
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "LDed_kinship.gds")

# open
genofile <- openfn.gds("LDed_kinship.gds")
genofile

# output to a GDS file
snpgdsGRM(genofile, method="GCTA", out.fn="GDS_LDed_kinship.gds")
# close
closefn.gds(genofile)

Fit model

library(SeqArray)
library(SAIGEgds)

pheno <-read.table("phenotype_PCcleaned_04.28.2021.txt", header=TRUE)

glmm <- seqFitNullGLMM_SPA(PHENO_FINAL ~ as.factor(SEX)+ AGE + PC1 +PC2+ PC3 , pheno, "GDS_LDed_kinship.gds" , trait.type="binary",sample.col="TOPMED_IID", num.thread=10)

Error:

SAIGE association analysis:
Wed Apr 28 14:41:43 2021
Open 'LDed_kinship.gds'
Error in seqOpen(gdsfile) :
  'LDed_kinship.gds' is not a SeqArray GDS file ('FileFormat' should be 'SEQ_ARRAY').

Resource: https://uw-gac.github.io/topmed_workshop_2017/computing-a-grm.html

I understand I do not have SEQ data, but genotyped data. Error in seqOpen(gdsfile) cannot work on genotyped data.
I would not like to convert genotype data into vcf and re-run complete pipeline.

Any points to proceed further?

Build failure on Linux ARM64

Hello,

We are working on adding support for Linux ARM64 for Bioconductor and we faced the following build problem:

### Running command:
###
###   /home/biocbuild/bbs-3.17-bioc/R/bin/R CMD INSTALL SAIGEgds
###
##############################################################################
##############################################################################


* installing to library ‘/home/biocbuild/bbs-3.17-bioc/R/library’
* installing *source* package ‘SAIGEgds’ ...
** using staged installation
** libs
using C++ compiler: ‘g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0’
using C++11
g++ -std=gnu++11 -I"/home/biocbuild/bbs-3.17-bioc/R/include" -DNDEBUG  -I'/home/biocbuild/bbs-3.17-bioc/R/library/Rcpp/include' -I'/home/biocbuild/bbs-3.17-bioc/R/library/RcppArmadillo/include' -I'/home/biocbuild/bbs-3.17-bioc/R/library/RcppParallel/include' -I/usr/local/include    -fPIC  -g -O2  -Wall -c SPATest.cpp -o SPATest.o
SPATest.cpp:42:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   42 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:56:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   56 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:71:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   71 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:93:9: error: expected initializer before ‘getroot_K1’
   93 |         getroot_K1(const double g_pos, const double g_neg,
      |         ^~~~~~~~~~
SPATest.cpp:140:9: error: expected initializer before ‘getroot_K1_fast’
  140 |         getroot_K1_fast(const double g_pos, const double g_neg, double &root,
      |         ^~~~~~~~~~~~~~~
SPATest.cpp:189:9: error: expected initializer before ‘get_saddle_prob’
  189 |         get_saddle_prob(double t, size_t n_g, const double mu[], const double g[],
      |         ^~~~~~~~~~~~~~~
SPATest.cpp:212:9: error: expected initializer before ‘get_saddle_prob_fast’
  212 |         get_saddle_prob_fast(double t, size_t n_nonzero, const double mu[],
      |         ^~~~~~~~~~~~~~~~~~~~
SPATest.cpp:238:9: error: expected initializer before ‘Saddle_Prob’
  238 |         Saddle_Prob(double q, double m1, double var1, size_t n_g, const double mu[],
      |         ^~~~~~~~~~~
SPATest.cpp:300:9: error: expected initializer before ‘Saddle_Prob_Fast’
  300 |         Saddle_Prob_Fast(double q, double m1, double var1, size_t n_g,
      |         ^~~~~~~~~~~~~~~~
make: *** [/home/biocbuild/bbs-3.17-bioc/R/etc/Makeconf:198: SPATest.o] Error 1
ERROR: compilation failed for package ‘SAIGEgds’
* removing ‘/home/biocbuild/bbs-3.17-bioc/R/library/SAIGEgds’

https://yikun.github.io/bioconductor-0208/report/SAIGEgds/kunpeng1-buildsrc.html

Any idea what might be the problem ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.