abbvie-computationalgenomics / saigegds Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 5.0 2.67 MB

Scalable Implementation of generalized mixed models using GDS files in Phenome-Wide Association Studies

R 48.83% C++ 51.17%

gds mixed-model gwas phewas

saigegds's Introduction

SAIGEgds: Scalable Implementation of Generalized mixed models in PheWAS using GDS files

GNU General Public License, GPLv3

Features

Scalable implementation of generalized mixed mode with the support of Genomic Data Structure (GDS) files and highly optimized C++ implementation. It is designed for single variant tests in large-scale phenome-wide association studies (PheWAS) with millions of variants and hundreds of thousands of samples (e.g., UK Biobank genotype data), controlling for case-control imbalance and sample structure in single variant association studies.

The implementation of SAIGEgds is based on the original SAIGE R package (v0.29.4.4) [Zhou et al. 2018]. It is implemented with optimized C++ codes taking advantage of sparse structure of genotypes. All of the calculation with single-precision floating-point numbers in SAIGE are replaced by the double-precision calculation in SAIGEgds. SAIGEgds also implements some of the SPAtest functions in C to speed up the calculation of Saddlepoint approximation.

Benchmarks using the UK Biobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.

Bioconductor:

Release Version: v1.12.1 (http://www.bioconductor.org/packages/SAIGEgds)

Package Maintainer

Xiuwen Zheng

Installation

Requires R (≥ v3.5.0), gdsfmt (≥ v1.20.0), SeqArray (≥ v1.32.0)
Recommend GNU GCC (≥ v6.0), requiring C++11
Bioconductor repository

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SAIGEgds")

The BiocManager::install() approach may require that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

Development version from Github (for developers/testers only)

library("devtools")
install_github("AbbVie-ComputationalGenomics/SAIGEgds")

Package vignette

If the package is installed from Bioconductor repository or package rebuilding, users can start R and enter to view documentation:

browseVignettes("SAIGEgds")

Examples

library(SeqArray)
library(SAIGEgds)

# open the GDS file for genetic relationship matrix (GRM)
grm_fn <- system.file("extdata", "grm1k_10k_snp.gds", package="SAIGEgds")
(grm_gds <- seqOpen(grm_fn))

# load phenotype
phenofn <- system.file("extdata", "pheno.txt.gz", package="SAIGEgds")
pheno <- read.table(phenofn, header=TRUE, as.is=TRUE)
head(pheno)
##   sample.id y     yy      x1 x2
## 1        s1 0 4.5542  1.5118  1
## 2        s2 0 3.7941  0.3898  1
## 3        s3 0 5.0411 -0.6212  1
## ...

# fit the null model
glmm <- seqFitNullGLMM_SPA(y ~ x1 + x2, pheno, grm_gds, trait.type="binary",
    sample.col="sample.id", num.thread=2)
## SAIGE association analysis:
## Filtering variants:
## [==================================================] 100%, completed, 0s
## Fit the null model: y ~ x1 + x2 + var(GRM)
##     # of samples: 1,000
##     # of variants: 9,976
##     using 2 threads
## ...

# close the file
seqClose(grm_gds)



################################

# open the GDS file for association testing
geno_fn <- system.file("extdata", "assoc_100snp.gds", package="SAIGEgds")
(geno_gds <- seqOpen(geno_fn))
## File: assoc_100snp.gds (10.5K)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 1000 LZMA_ra(12.6%), 625B }
## |--+ variant.id   { Int32 100 LZMA_ra(48.5%), 201B } *
## ...


# p-value calculation
assoc <- seqAssocGLMM_SPA(geno_gds, glmm, mac=10, parallel=2)
## SAIGE association analysis:
##     # of samples: 1,000
##     # of variants: 100
##     MAF threshold: NaN
##     MAC threshold: 10
##     missing threshold for variants: 0.1
##     p-value threshold for SPA adjustment: 0.05
##     variance ratio for approximation: 0.9391186
##     # of processes: 2
## [==================================================] 100%, completed, 0s
## # of variants after filtering by MAF, MAC and missing thresholds: 38
## Done.

head(assoc)
##   id chr pos rs.id ref alt AF.alt mac  num       beta        SE      pval pval.noadj converged
## 1  4   1   4   rs4   A   C 0.0100  20 1000  -0.074992  0.791685  0.924533   0.924533      TRUE
## 2 12   1  12  rs12   A   C 0.0150  30 1000  -0.091001  0.657140  0.889861   0.889861      TRUE
## 3 14   1  14  rs14   A   C 0.0375  75 1000  -0.075455  0.434152  0.862023   0.862023      TRUE
## ...


# close the file
seqClose(geno_gds)

Citations

Zheng X, Davis J.Wade. SAIGEgds -- an efficient statistical tool for large-scale PheWAS with mixed models. Bioinformatics (2020). DOI: 10.1093/bioinformatics/btaa731.

Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet (2018). Sep;50(9):1335-1341. DOI: 10.1038/s41588-018-0184-y.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D. SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics (2017). DOI: 10.1093/bioinformatics/btx145.

saigegds's People

Contributors

Stargazers

Watchers

Forkers

jgrundstad zhouhufeng rwarrencodes martin-g akhilpampana

saigegds's Issues

Using SAIGEgds for non additive models

Hi,

I use SAIGEgds for smaller scale analyses and think it's a great tool (SNP numbers approximately 1000). Iwas wondering if SAIGEgds could be used for alternative models of inheritance. So, recoding the dosage to reflect a dominant model. That also got me thinking if it could also be used for genetic risk scores, in effect having a single "SNP" that reflected the risk score. I can see how this would be achieved using the gds structure and recoding but just wondered if you had any insight into whether this approach invalidated the underlying principles of the analysis.

Best,
Richard

Version of SAIGE used in SAIGEgds ?

Hi, this is a great idea, and really useful package. 2 quesions:

I note from the README.md that this package is "based on the original SAIGE package (v0.29.4.4)". But I'm not certain if that means this package uses SAIGEv0.29.4.4. Does this mean it has not been updated as SAIGE has progressed forwards to 0.43.3? I can see a number of beneficial updates in the ChangeLog for SAIGE and I'm wondering if any of these have been implemented in this package as I can see there have been seperate improvements in this package (such as the use of ACAT-O, where SAIGE only allows SKAT-O).
In SAIGE for gene-based testing there is an option minMAFforGRM to ensure only MAF>0.05 or 0.01 are used for creating the GRM. Is this automatically done in the SAIGEgds::seqFitNullGLMM_SPA() function? I can't see an argument for it in the man pages. Although, I see that the argument "variant.id" is used to select variants to be used for the GRM (after the pruning step in the tutorial), so an additional subset step on this character vector using allele frequency derived from the pruned gds could solve the problem? Please do let me know if I am missing something in this function! Thank you.

Thank you so much for your work on this!

Error: glm.fit: fitted probabilities numerically 0 or 1 occurred

I am trying to fit NULL model using GRM generated seq-array from genotyped data. I have binary phenotype (0/1) for ~6K individuals.
When I try to fir null model I get error for probabilities numerically. Below is the code:

library(SeqArray)
library(SAIGEgds)

pheno <-read.table("phenotype_04.28.2021.txt", header=TRUE)

glmm <- seqFitNullGLMM_SPA(AD ~ SEX+ AGE + PC1 +PC2+ PC3 , pheno, "seqarray_grm_geno.gds" , trait.type="binary",sample.col="TOPMED_IID", num.thread=10)

Error:

	Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Invalid tbb::this_task_arena::current_thread_index()!
Aborted (core dumped)

Is this related to input data type or low memory?

From vcf to bgen to gds? dosage value or genotype probability value?

Hi team,

I am interested in using this library for an admixed population with case-control imbalance. Data were imputed using TOPMED panel michigan imputation server.
I would like to know during analysis are dosage information used or, genotype probability, or genotype value used from the input GDS file?

Also, what is an ideal way to perform data transformation without much hiccups to the GDS? Using PLINK (v2.0) I can convert imputed data to the .bgen format. Is "--export bgen-1.2" right file format when using PLINK?

thank you,

Additional variables in output from seqAssocGLMM_SPA?

Is it possible to add additional variables to the output from seqAssocGLMM_SPA, i.e. INFO-score if present in the original VCF files?

Thank you for your help.

GDS sample IDs problem

Hi!
I fitted the null model and calculated association p-values for one dataset and then tried with the same dataset but changing the sample IDs (.fam file was modified first and then converted into GDS format with the seqArray function).

Coefficients and variance ratio in the null model aren't the same even though the dataset is, and association results vary too much.
Only thing that is different between the .gds files is the "sample ID" annotation branch. I repeated the first analysis (with the original sample IDs) three times and I get the same results (which I think are ok), so I believe the issue comes with the new sample ID names.

I would appreciate any help.
Thank you

Results: beta (odds ratios) < 0

In the result, "beta" indicates : beta coefficient, odds ratio if binary outcomes (alternative allele vs. reference
allele). However, when I conduct GWAS on binary outcomes, there are some SNPs with beta < 0. It's impossible that the odds ratio is < 0. Would you mind me asking how to interpret it? Should I delete SNPs with beta < 0?

   id chr      pos rs.id ref alt       AF.alt   mac   num        beta         SE       pval
1:  3  22 16868201         C   A 0.0004218781    17 20148  0.95400134 1.41909658 0.50141843
2:  5  22 16868590         G   A 0.0005707763    23 20148 -1.06041310 1.18949813 0.37267203
3:  6  22 16868603         C   T 0.0003226127    13 20148 -1.06501842 1.50172130 0.47820133
4:  7  22 16868711         G   A 0.0002977963    12 20148 -1.05899912 1.67452636 0.52711436
5: 12  22 16869223         G   A 0.2621351995 10563 20148 -0.12863379 0.06312118 0.04156131

Error: is not a SeqArray GDS file || kinship using genotyped data

create GRM in .gds file format


library(SNPRelate)

bed.fn<-"mafed_LDed.bed"
fam.fn<-"mafed_LDed.fam"
bim.fn<-"mafed_LDed.bim"

# convert
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "LDed_kinship.gds")

# open
genofile <- openfn.gds("LDed_kinship.gds")
genofile

# output to a GDS file
snpgdsGRM(genofile, method="GCTA", out.fn="GDS_LDed_kinship.gds")
# close
closefn.gds(genofile)

Fit model

library(SeqArray)
library(SAIGEgds)

pheno <-read.table("phenotype_PCcleaned_04.28.2021.txt", header=TRUE)

glmm <- seqFitNullGLMM_SPA(PHENO_FINAL ~ as.factor(SEX)+ AGE + PC1 +PC2+ PC3 , pheno, "GDS_LDed_kinship.gds" , trait.type="binary",sample.col="TOPMED_IID", num.thread=10)

Error:

SAIGE association analysis:
Wed Apr 28 14:41:43 2021
Open 'LDed_kinship.gds'
Error in seqOpen(gdsfile) :
  'LDed_kinship.gds' is not a SeqArray GDS file ('FileFormat' should be 'SEQ_ARRAY').

Resource: https://uw-gac.github.io/topmed_workshop_2017/computing-a-grm.html

I understand I do not have SEQ data, but genotyped data. Error in seqOpen(gdsfile) cannot work on genotyped data.
I would not like to convert genotype data into vcf and re-run complete pipeline.

Any points to proceed further?

Build failure on Linux ARM64

Hello,

We are working on adding support for Linux ARM64 for Bioconductor and we faced the following build problem:

### Running command:
###
###   /home/biocbuild/bbs-3.17-bioc/R/bin/R CMD INSTALL SAIGEgds
###
##############################################################################
##############################################################################


* installing to library ‘/home/biocbuild/bbs-3.17-bioc/R/library’
* installing *source* package ‘SAIGEgds’ ...
** using staged installation
** libs
using C++ compiler: ‘g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0’
using C++11
g++ -std=gnu++11 -I"/home/biocbuild/bbs-3.17-bioc/R/include" -DNDEBUG  -I'/home/biocbuild/bbs-3.17-bioc/R/library/Rcpp/include' -I'/home/biocbuild/bbs-3.17-bioc/R/library/RcppArmadillo/include' -I'/home/biocbuild/bbs-3.17-bioc/R/library/RcppParallel/include' -I/usr/local/include    -fPIC  -g -O2  -Wall -c SPATest.cpp -o SPATest.o
SPATest.cpp:42:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   42 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:56:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   56 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:71:15: error: ‘COREARRAY_TARGET_CLONES’ does not name a type
   71 | inline static COREARRAY_TARGET_CLONES
      |               ^~~~~~~~~~~~~~~~~~~~~~~
SPATest.cpp:93:9: error: expected initializer before ‘getroot_K1’
   93 |         getroot_K1(const double g_pos, const double g_neg,
      |         ^~~~~~~~~~
SPATest.cpp:140:9: error: expected initializer before ‘getroot_K1_fast’
  140 |         getroot_K1_fast(const double g_pos, const double g_neg, double &root,
      |         ^~~~~~~~~~~~~~~
SPATest.cpp:189:9: error: expected initializer before ‘get_saddle_prob’
  189 |         get_saddle_prob(double t, size_t n_g, const double mu[], const double g[],
      |         ^~~~~~~~~~~~~~~
SPATest.cpp:212:9: error: expected initializer before ‘get_saddle_prob_fast’
  212 |         get_saddle_prob_fast(double t, size_t n_nonzero, const double mu[],
      |         ^~~~~~~~~~~~~~~~~~~~
SPATest.cpp:238:9: error: expected initializer before ‘Saddle_Prob’
  238 |         Saddle_Prob(double q, double m1, double var1, size_t n_g, const double mu[],
      |         ^~~~~~~~~~~
SPATest.cpp:300:9: error: expected initializer before ‘Saddle_Prob_Fast’
  300 |         Saddle_Prob_Fast(double q, double m1, double var1, size_t n_g,
      |         ^~~~~~~~~~~~~~~~
make: *** [/home/biocbuild/bbs-3.17-bioc/R/etc/Makeconf:198: SPATest.o] Error 1
ERROR: compilation failed for package ‘SAIGEgds’
* removing ‘/home/biocbuild/bbs-3.17-bioc/R/library/SAIGEgds’

https://yikun.github.io/bioconductor-0208/report/SAIGEgds/kunpeng1-buildsrc.html

Any idea what might be the problem ?