zhanxw / rvtests Goto Github PK

View Code? Open in Web Editor NEW

125.0 13.0 40.0 33.47 MB

Rare variant test software for next generation sequencing data

Makefile 1.73% C++ 92.40% C 1.32% R 1.16% Shell 0.43% Fortran 2.72% Python 0.24%

next-generation-sequencing ngs association-analysis gwas rare-variant meta-analysis genotype skat c-plus-plus variants

rvtests's Introduction

Table of Contents

Introduction
Citation
Download
Quick tutorial
Input files
Models
Association test options
Sex chromosome analysis
Kinship generation
Resources
- UCSC RefFlat Genes
- Gencode Genes
Frequently Asked Questions (FAQ)
Feedback/Contact

(Updated: October 2017)

Introduction

Rvtests, which stands for Rare Variant tests, is a flexible software package for genetic association analysis for sequence datasets. Since its inception, rvtests was developed as a comprehensive tool to support genetic association analysis and meta-analysis. It can analyze both unrelated individual and related (family-based) individuals for both quantitative and binary outcomes. It includes a variety of association tests (e.g. single variant score test, burden test, variable threshold test, SKAT test, fast linear mixed model score test). It takes VCF/BGEN/PLINK format as genotype input file and takes PLINK format phenotype file and covariate file.

With new implementation of the BOLT-LMM/MINQUE algorithm as well as a series of software engineering optimizations, our software package is capable of analyzing datasets of up to 1,000,000 individuals in linear mixed models on a computer workstation, which makes our tool one of the very few options for analyzing large biobank scale datasets, such as UK Biobank. RVTESTS supports both single variant and gene-level tests. It also allows for highly effcient generation of covariance matrices between score statistics in RAREMETAL format, which can be used to support the next wave of meta-analysis that incorporates large biobank datasets.

A (much) larger sample size can be handled using linear regression or logistic regression models.

Citation

Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu

RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data

Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)

Download

Source codes can be downloaded from github or github page. In Linux, you can use

git clone https://github.com/zhanxw/rvtests

to retrieve the latest distribution for rvtests. To install, go to the rvtests folder and type make. When compilation succeed, the executable is under the executable folder. Simply type executable/rvtests can get you started.

Alternatively, binary executable files (for Linux 64-bit platform) can be downloaded from here.

Quick tutorial

Here is a quick example of how to use rvtests software in typical use cases.

Single variant tests

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --single wald,score

This specifies single variant Wald and score test for association tests for every variant in the input.vcf file. The 6th column of the phenotype file, phenotype.ped, which is in PLINK format, is used. Rvtests will automatically check whether the phenotype is binary trait or quantitative trait. For binary trait, the recommended way of coding is to code controls as 1, cases as 2, missing phenotypes as -9 or 0.

For other types of association tests, you can refer to Models.

Groupwise tests

Groupwise tests includes three major kinds of tests.

Burden tests: group variants, which are usually less than 1% or 5% rare variants, for association tests. The category includes: CMC test, Zeggini test, Madsen-Browning test, CMAT test, and rare-cover test.
Variable threshold tests: group variants under different frequency thresholds.
Kernel methods: suitable to tests rare variants having different directions of effects. These includes SKAT test and KBAC test.

All above tests requires to group variants into a unit. The simplest case is to use gene as grouping unit. For different grouping method, see Grouping.

To perform rare variant tests by gene, you need to use --geneFile to specify the gene range in a refFlat format. We provided different gene definitions in the Resources section. You can use --gene to specify which gene(s) to test. For example, specify --gene CFH,ARMS2 will perform association tests on CFH and ARMS2 genes. If there is no providing --gene option, all genes will be tests.

The following command line demonstrate how to use CMC method, variable threshold method(proposed by Price) and kernel based method (SKAT by Shawn Lee and KBAC by Dajiang Liu) to test every gene listed in refFlat_hg19.txt.gz.

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --geneFile refFlat_hg19.txt.gz --burden cmc --vt price --kernel skat,kbac

Related individual tests

To test related individuals, you will need to first create a kinship matrix:

vcf2kinship --inVcf input.vcf --bn --out output

The option --bn means calculating empirical kinship using Balding-Nicols method. You can specify --ibs to obtain IBS kinship or use --pedigree input.ped to calculate kinship from known pedigree information.

Then you can use linear mixed model based association tests such as Fast-LMM score test, Fast-LMM LRT test and Grammar-gamma tests. An exemplar command is shown:

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --kinship output.kinship --single famScore,famLRT,famGrammarGamma

Meta-analysis tests

The meta-analysis models outputs association test results and genotype covariance matrix. These calculated summary statistics can be used in rare variant association analysis (details). We provide single variant score test and generate a genotype covariance matrix. You can use this command:

rvtest --inVcf input.vcf --pheno phenotype.ped --meta score,cov --out output

In a more realistic scenario, you may want to adjust for covariates and want to inverse normalized residuals obtained in null model (link to our methodology paper), then this command will work:

rvtest --inVcf input.vcf --pheno phenotype.ped --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype  --meta score,cov --out output

Here the --covar specify a covariate file, and --covar-name specify which covariates can used in the analysis. Covariate file format can be found [here](#Covariate file). --inverseNormal --useResidualAsPhenotype specifies trait transformation method. That means first fit a regression model of the phenotype on covariates (intercept automatically added), then the residuals are inverse normalized. Trait transformation details can be found [here](#Trait transformation).

We support both unrelated individuals and related individuals (e.g. family data). You need to append --kinship input.kinship to the command line:

rvtest --inVcf input.vcf --pheno phenotype.ped --meta score,cov --out output --kinship input.kinship

The file input.kinship is calculated by vcf2kinship program, and usage to this program is described in Related individual tests.

NOTE: by default, the covariance matrix are calculated in a sliding-window of 1 million base pairs. You can change this setting via the option windowSize. For example, --meta cov[windowSize=500000] specify a 500k-bp sliding window.

Dominant models and recessive models

Dominant and recessive disease models are supported by appending "dominant" and/or "recessive" after "--meta" option. For example, use "--meta dominant,recessive" will generate two sets of files. For dominant model, they are "prefix.MetaDominant.assoc" and "prefix.MetaDominantCov.assoc.gz"; for recessive model, they are "prefix.MetaRecessive.assoc" and "prefix.MetaRecessiveCov.assoc.gz". Internally, in dominant models, genotypes 0/1/2 are coded as 0/1/1; in recessive models, genotypes 0/1/2 are coded as 0/0/1. Missing genotypes will be imputed to the mean.

Input files

Genotype files (VCF, BCF, BGEN, KGG)

Rvtests supports VCF (Variant Call Format) files. Files in both plain text format or gzipped format are supported. To use group-based rare variant tests, indexed the VCF files using tabix are required.

Here are the commands to convert plain text format to bgzipped VCF format:

(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file 
tabix -f -p vcf $your_vcf_file

The above commands will (1) remove the chr prefix from chromosome names; (2) sort VCF files by chromosome first, then by chromosomal positions; (3) compress using bgzip; (4) create tabix index.

Rvtests support genotype dosages. Use --dosage DosageTag to specify the dosage tag. For example, if VCF format field is "GT:EC" and individual genotype fields is "0/0:0.02", you can use --dosage EC, and rvtests will use the dosage 0.02 in the regression models.

Rvtests suppport BGEN input format v1.0 throught v1.3. Instead of using --inVcf, use --inBgen to specify a BGEN file and --inBgenSample to specify the accompany SAMPLE file.

Rvtests support KGGSeq input format. This format is an extension to binary PLINK formats. Use --inKgg to replace --inVcf.

Phenotype file

You can use --mpheno $phenotypeColumnNumber or --pheno-name to specify a given phenotype.

An example phenotype file, (example.pheno), has the following format:

fid iid fatid matid sex y1 y2 y3 y4
P1 P1 0 0 0 1.7642934435605 -0.733862638327895 -0.980843608339726 2
P2 P2 0 0 0 0.457111744989746 0.623297281416372 -2.24266162284447 1
P3 P3 0 0 0 0.566689682543218 1.44136462889459 -1.6490100777089 1
P4 P4 0 0 0 0.350528353203767 -1.79533911725537 -1.11916876241804 1
P5 P5 0 0 1 2.72675074738545 -1.05487747371158 -0.33586430010589 2

Phenotype file is specified by the option --pheno example.pheno . The default phenotype column header is “y1”. If you want to use alternative columns as phenotype for association analysis (e.g the column with header y2), you may specify the phenotype by column or by name using either

--mpheno 2
--pheno-name y2

NOTE: to use “--pheno-name”, the header line must starts with “fid iid” as PLINK requires.

In phenotype file, missing values can be denoted by NA or any non-numeric values. Individuals with missing phenotypes will be automatically dropped from subsequent association analysis. For each missing phenotype value, a warning will be generated and recorded in the log file.

When the phenotype values are only 0, 1 and 2, rvtests will automatically treat it as binary traits. However, if you want to treat it as continuous trait, please use "--qtl" option.

Covariate file

You can use --covar and --covar-name to specify covariates that will be used for single variant association analysis. This is an optional parameter. If you do not have covariate in the data, this option can be ignored.

The covariate file, (e.g. example.covar) has a similar format as the phenotype file:

fid iid fatid matid sex y1 y2 y3 y4
P1 P1 0 0 0 1.911 -1.465 -0.817 1
P2 P2 0 0 0 2.146 -2.451 -0.178 2
P3 P3 0 0 0 1.086 -1.194 -0.899 1
P4 P4 0 0 0 0.704 -1.052 -0.237 1
P5 P5 0 0 1 2.512 -3.085 -2.579 1

The covariate file is specified by the --covar option (e.g. --covar example.covar). To specify covariates that will be used in the association analysis, the option --covar-name can be used. For example, when age, bmi and 3 PCs are used for association analysis, the following option can be specified for the rvtests program, i.e. --covar example.covar --covar-name age,bmi,pc1,pc2,pc3.

Note: Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA). They will be automatically imputed to the mean value in the data file.

Trait transformation

In this meta-analysis, we use inverse normal transformed residuals in the association analysis, which is achieved by using a combination of --inverseNormal and --useResidualAsPhenotype. Specifically, we first fit the null model by regressing phenotype on covariates. The residuals are then inverse normal transformed (see Appendix A more detailed formula for transformation). Transformed residuals will be used to obtain score statistics.

In meta analysis, an exemplar command for using rvtests looks like the following:

./rvtest --inVcf $vcf --pheno $example.pheno --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype  --meta score,cov --out $output_prefix

Models

Rvtests support various association models.

Single variant tests

Single variant	Model(#)	Traits(##)	Covariates	Related / unrelated	Description
Score test	score	B, Q	Y	U	Only null model is used to performed the test
Wald test	wald	B, Q	Y	U	Only fit alternative model, and effect size will be estimated
Exact test	exact	B	N	U	Fisher's test (allelic test)
Dominant Exact test	dominantExact	B	N	U	Fisher's test (dominant codings)
Fam LRT	famLRT	Q	Y	R, U	Fast-LMM model
Fam Score	famScore	Q	Y	R, U	Fast-LMM model style likelihood ratio test
Grammar-gamma	famGrammarGamma	Q	Y	R, U	Grammar-gamma method
Firth regression	firth	B	Y	U	Logistic regression with Firth correction by David Firth, discussed by Clement Ma.

(#) Model columns list the recognized names in rvtests. For example, use --single score will apply score test.