Git Product home page Git Product logo

rvtests's Introduction

Table of Contents

Build Status

(Updated: October 2017)

Introduction

Rvtests, which stands for Rare Variant tests, is a flexible software package for genetic association analysis for sequence datasets. Since its inception, rvtests was developed as a comprehensive tool to support genetic association analysis and meta-analysis. It can analyze both unrelated individual and related (family-based) individuals for both quantitative and binary outcomes. It includes a variety of association tests (e.g. single variant score test, burden test, variable threshold test, SKAT test, fast linear mixed model score test). It takes VCF/BGEN/PLINK format as genotype input file and takes PLINK format phenotype file and covariate file.

With new implementation of the BOLT-LMM/MINQUE algorithm as well as a series of software engineering optimizations, our software package is capable of analyzing datasets of up to 1,000,000 individuals in linear mixed models on a computer workstation, which makes our tool one of the very few options for analyzing large biobank scale datasets, such as UK Biobank. RVTESTS supports both single variant and gene-level tests. It also allows for highly effcient generation of covariance matrices between score statistics in RAREMETAL format, which can be used to support the next wave of meta-analysis that incorporates large biobank datasets.

A (much) larger sample size can be handled using linear regression or logistic regression models.

Citation

Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu

RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data

Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)

Download

Source codes can be downloaded from github or github page. In Linux, you can use

git clone https://github.com/zhanxw/rvtests

to retrieve the latest distribution for rvtests. To install, go to the rvtests folder and type make. When compilation succeed, the executable is under the executable folder. Simply type executable/rvtests can get you started.

Alternatively, binary executable files (for Linux 64-bit platform) can be downloaded from here.

Quick tutorial

Here is a quick example of how to use rvtests software in typical use cases.

Single variant tests

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --single wald,score

This specifies single variant Wald and score test for association tests for every variant in the input.vcf file. The 6th column of the phenotype file, phenotype.ped, which is in PLINK format, is used. Rvtests will automatically check whether the phenotype is binary trait or quantitative trait. For binary trait, the recommended way of coding is to code controls as 1, cases as 2, missing phenotypes as -9 or 0.

For other types of association tests, you can refer to Models.

Groupwise tests

Groupwise tests includes three major kinds of tests.

  • Burden tests: group variants, which are usually less than 1% or 5% rare variants, for association tests. The category includes: CMC test, Zeggini test, Madsen-Browning test, CMAT test, and rare-cover test.
  • Variable threshold tests: group variants under different frequency thresholds.
  • Kernel methods: suitable to tests rare variants having different directions of effects. These includes SKAT test and KBAC test.

All above tests requires to group variants into a unit. The simplest case is to use gene as grouping unit. For different grouping method, see Grouping.

To perform rare variant tests by gene, you need to use --geneFile to specify the gene range in a refFlat format. We provided different gene definitions in the Resources section. You can use --gene to specify which gene(s) to test. For example, specify --gene CFH,ARMS2 will perform association tests on CFH and ARMS2 genes. If there is no providing --gene option, all genes will be tests.

The following command line demonstrate how to use CMC method, variable threshold method(proposed by Price) and kernel based method (SKAT by Shawn Lee and KBAC by Dajiang Liu) to test every gene listed in refFlat_hg19.txt.gz.

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --geneFile refFlat_hg19.txt.gz --burden cmc --vt price --kernel skat,kbac

Related individual tests

To test related individuals, you will need to first create a kinship matrix:

vcf2kinship --inVcf input.vcf --bn --out output

The option --bn means calculating empirical kinship using Balding-Nicols method. You can specify --ibs to obtain IBS kinship or use --pedigree input.ped to calculate kinship from known pedigree information.

Then you can use linear mixed model based association tests such as Fast-LMM score test, Fast-LMM LRT test and Grammar-gamma tests. An exemplar command is shown:

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --kinship output.kinship --single famScore,famLRT,famGrammarGamma

Meta-analysis tests

The meta-analysis models outputs association test results and genotype covariance matrix. These calculated summary statistics can be used in rare variant association analysis (details). We provide single variant score test and generate a genotype covariance matrix. You can use this command:

rvtest --inVcf input.vcf --pheno phenotype.ped --meta score,cov --out output

In a more realistic scenario, you may want to adjust for covariates and want to inverse normalized residuals obtained in null model (link to our methodology paper), then this command will work:

rvtest --inVcf input.vcf --pheno phenotype.ped --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype  --meta score,cov --out output

Here the --covar specify a covariate file, and --covar-name specify which covariates can used in the analysis. Covariate file format can be found [here](#Covariate file). --inverseNormal --useResidualAsPhenotype specifies trait transformation method. That means first fit a regression model of the phenotype on covariates (intercept automatically added), then the residuals are inverse normalized. Trait transformation details can be found [here](#Trait transformation).

We support both unrelated individuals and related individuals (e.g. family data). You need to append --kinship input.kinship to the command line:

rvtest --inVcf input.vcf --pheno phenotype.ped --meta score,cov --out output --kinship input.kinship

The file input.kinship is calculated by vcf2kinship program, and usage to this program is described in Related individual tests.

NOTE: by default, the covariance matrix are calculated in a sliding-window of 1 million base pairs. You can change this setting via the option windowSize. For example, --meta cov[windowSize=500000] specify a 500k-bp sliding window.

Dominant models and recessive models

Dominant and recessive disease models are supported by appending "dominant" and/or "recessive" after "--meta" option. For example, use "--meta dominant,recessive" will generate two sets of files. For dominant model, they are "prefix.MetaDominant.assoc" and "prefix.MetaDominantCov.assoc.gz"; for recessive model, they are "prefix.MetaRecessive.assoc" and "prefix.MetaRecessiveCov.assoc.gz". Internally, in dominant models, genotypes 0/1/2 are coded as 0/1/1; in recessive models, genotypes 0/1/2 are coded as 0/0/1. Missing genotypes will be imputed to the mean.

Input files

Genotype files (VCF, BCF, BGEN, KGG)

Rvtests supports VCF (Variant Call Format) files. Files in both plain text format or gzipped format are supported. To use group-based rare variant tests, indexed the VCF files using tabix are required.

Here are the commands to convert plain text format to bgzipped VCF format:

(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file 
tabix -f -p vcf $your_vcf_file

The above commands will (1) remove the chr prefix from chromosome names; (2) sort VCF files by chromosome first, then by chromosomal positions; (3) compress using bgzip; (4) create tabix index.

Rvtests support genotype dosages. Use --dosage DosageTag to specify the dosage tag. For example, if VCF format field is "GT:EC" and individual genotype fields is "0/0:0.02", you can use --dosage EC, and rvtests will use the dosage 0.02 in the regression models.

Rvtests suppport BGEN input format v1.0 throught v1.3. Instead of using --inVcf, use --inBgen to specify a BGEN file and --inBgenSample to specify the accompany SAMPLE file.

Rvtests support KGGSeq input format. This format is an extension to binary PLINK formats. Use --inKgg to replace --inVcf.

Phenotype file

You can use --mpheno $phenotypeColumnNumber or --pheno-name to specify a given phenotype.

An example phenotype file, (example.pheno), has the following format:

fid iid fatid matid sex y1 y2 y3 y4
P1 P1 0 0 0 1.7642934435605 -0.733862638327895 -0.980843608339726 2
P2 P2 0 0 0 0.457111744989746 0.623297281416372 -2.24266162284447 1
P3 P3 0 0 0 0.566689682543218 1.44136462889459 -1.6490100777089 1
P4 P4 0 0 0 0.350528353203767 -1.79533911725537 -1.11916876241804 1
P5 P5 0 0 1 2.72675074738545 -1.05487747371158 -0.33586430010589 2

Phenotype file is specified by the option --pheno example.pheno . The default phenotype column header is “y1”. If you want to use alternative columns as phenotype for association analysis (e.g the column with header y2), you may specify the phenotype by column or by name using either

  • --mpheno 2
  • --pheno-name y2

NOTE: to use “--pheno-name”, the header line must starts with “fid iid” as PLINK requires.

In phenotype file, missing values can be denoted by NA or any non-numeric values. Individuals with missing phenotypes will be automatically dropped from subsequent association analysis. For each missing phenotype value, a warning will be generated and recorded in the log file.

When the phenotype values are only 0, 1 and 2, rvtests will automatically treat it as binary traits. However, if you want to treat it as continuous trait, please use "--qtl" option.

Covariate file

You can use --covar and --covar-name to specify covariates that will be used for single variant association analysis. This is an optional parameter. If you do not have covariate in the data, this option can be ignored.

The covariate file, (e.g. example.covar) has a similar format as the phenotype file:

fid iid fatid matid sex y1 y2 y3 y4
P1 P1 0 0 0 1.911 -1.465 -0.817 1
P2 P2 0 0 0 2.146 -2.451 -0.178 2
P3 P3 0 0 0 1.086 -1.194 -0.899 1
P4 P4 0 0 0 0.704 -1.052 -0.237 1
P5 P5 0 0 1 2.512 -3.085 -2.579 1

The covariate file is specified by the --covar option (e.g. --covar example.covar). To specify covariates that will be used in the association analysis, the option --covar-name can be used. For example, when age, bmi and 3 PCs are used for association analysis, the following option can be specified for the rvtests program, i.e. --covar example.covar --covar-name age,bmi,pc1,pc2,pc3.

Note: Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA). They will be automatically imputed to the mean value in the data file.

Trait transformation

In this meta-analysis, we use inverse normal transformed residuals in the association analysis, which is achieved by using a combination of --inverseNormal and --useResidualAsPhenotype. Specifically, we first fit the null model by regressing phenotype on covariates. The residuals are then inverse normal transformed (see Appendix A more detailed formula for transformation). Transformed residuals will be used to obtain score statistics.

In meta analysis, an exemplar command for using rvtests looks like the following:

./rvtest --inVcf $vcf --pheno $example.pheno --covar example.covar --covar-name age,bmi --inverseNormal --useResidualAsPhenotype  --meta score,cov --out $output_prefix  

Models

Rvtests support various association models.

Single variant tests

Single variant Model(#) Traits(##) Covariates Related / unrelated Description
Score test score B, Q Y U Only null model is used to performed the test
Wald test wald B, Q Y U Only fit alternative model, and effect size will be estimated
Exact test exact B N U Fisher's test (allelic test)
Dominant Exact test dominantExact B N U Fisher's test (dominant codings)
Fam LRT famLRT Q Y R, U Fast-LMM model
Fam Score famScore Q Y R, U Fast-LMM model style likelihood ratio test
Grammar-gamma famGrammarGamma Q Y R, U Grammar-gamma method
Firth regression firth B Y U Logistic regression with Firth correction by David Firth, discussed by Clement Ma.

(#) Model columns list the recognized names in rvtests. For example, use --single score will apply score test.

(##) In trait column, B or Q stand for binary or quantitative trait, respectively.

Burden tests

Burden tests Model(#) Traits(##) Covariates Related / unrelated Description
CMC cmc B, Q Y U Collapsing and combine rare variants by Bingshan Li.
Zeggini zeggini B, Q Y U Aggregate counts of rare variants by Morris Zeggini.
Madsen-Browning mb B Y U Up-weight rare variant using inverse frequency from controls by Madsen.
Fp fp B Y U Up-weight rare variant using inverse frequency from controls by Danyu Lin.
Exact CMC exactCMC B N U Collapsing and combine rare variants, then perform Fisher's exact test.
CMC Wald cmcWald B, Q Y U Collapsing and combine rare variants, then perform Wald test.
RareCover rarecover B N U Find optimal grouping unit for rare variant tests by Thomas Hoffman.
CMAT cmat B N U Test non-coding variants by Matt Zawistowski.
FamCMC famcmc Q Y R Collapsing and combine rare variants extended to related samples.
FamZeggini famzeggini Q Y R Aggregate counts of rare variants extended to related samples.

(#) Model columns list the recognized names in rvtests. For example, use --burden cmc will apply CMC test.

(##) In trait column, B or Q stand for binary or quantitative trait, respectively.

Variable threshold models

Single variant Model(#) Traits(##) Covariates Related / unrelated Description
Variable threshold model by permutation price B, Q N U Every rare-variant frequency cutoffs are tests by Alkes Price.
Variable threshold model by analytic form analytic Q Y U Every rare-variant frequency cutoffs are tests by Danyu Lin.
Variable threshold model by analytic form famAnalytic Q Y R Every rare-variant frequency cutoffs are tests by Dajiang Liu.

(#) Model columns list the recognized names in rvtests. For example, use --vt price will variable threshold test. NOTE: our implementation of Price's test diffs from the original method described in Price's publication. We test every minor allele frequency cutoff (instead of reference allele counts) and this is a two-sided test (instead of one-sided test).

(##) In trait column, B or Q stand for binary or quantitative trait, respectively.

Kernel models

Kernel Model(#) Traits(##) Covariates Related / unrelated Description
SKAT skat B, Q Y U Sequencing kernel association test by Shawn Lee.
SKATO skato B, Q Y U Optimal sequencing kernel association test (SKAT-O) by Shawn Lee. (###)
KBAC kbac B N U Kernel-based adaptive clustering model by Dajiang Liu.
FamSKAT famSkat Q Y R Sequencing kernel association test extended to related individuals by Han Chen.

(#) Model columns list the recognized names in rvtests. For example, use --kernel skat will apply SKAT test. To further customize SKAT test, you can use --kernel skat[nPerm=100:alpha=0.001:beta1=1:beta2=20] to specify permutation counts, type-1 error, beta distribution parameters for up-weighting rare variants. Rvtests will output a message showing:

[INFO]  SKAT test significance will be evaluated using 10000 permutations at alpha = 0.001 (beta1 = 1.00, beta2 = 20.00)

Tip: When specifying beta1=1 and beta2 = 1, the associate test is equivalent to an unweighted SKAT test.

(##) In trait column, B or Q stand for binary or quantitative trait, respectively.

(###) SKAT-O implementation may have slightly different results compared with outputs from SKAT R package. That's probably due to numerical stability.

Meta-analysis models

Type Model(#) Traits(##) Covariates Related / unrelated Description
Score test score B,Q Y R, U standard score tests
Dominant model dominant B,Q Y R, U score tests and covariance matrix under dominant disease model
Recessive model recessive B,Q Y R, U score tests and covariance matrix under recessive disease model
Covariance cov B,Q Y R, U covariance matrix
BOLT-LMM score test bolt Q Y R BOLT-LMM based score tests (###)
BOLT-LMM covariance boltCov Q Y R BOLT-LMM based score tests (###)

(#) Model columns list the recognized names in rvtests. For example, use --meta score,cov will generate score statistics and covariance matrix for meta-analysis.

(##) In trait column, B or Q stand for binary or quantitative trait, respectively.

(###) This is an experimental feature. This method requires LD-pruned genotype data in the binary PLINK format specified by --boltPlink. A minimal example to run BOLT-LMM based score tests is: rvtest --inVcf $inputVCFfile --boltPlink $binaryPlinkPrefix --pheno $phenotype --meta bolt. To prune your genotype data, an example command line is plink --vcf $inputVCFFile --maf 0.05 --indep-pairwise 50 5 0.5 --make-bed $binaryPlinkPrefix. Please note RVTESTS additionally prohibit duplicated rs IDs in the PLINK BIM file. To remove duplications, you can list all duplicated cut -f2 $binaryPlinkBIMfile|sort |uniq -d > duplicate.rsid, and then use plink --vcf $inputVCFFile --maf 0.05 --indep-pairwise 50 5 0.5 --exclude duplicate.rsid --make-bed $binaryPlinkPrefix. An example script to run BOLT-LMM model can be found under example/experimental directory.

The above models are suitable to generate summary statistics which can be later meta-analyzed (see Dajiang Liu (2014) Nature Genetics). Rvtests implemented the above methods and the results can be further analyzed by RareMetals (link) for quantitative trait and RareMetals2 (link). It also worth to mention that our group offers another toolset for meta analysis (link).

Explanation of outputs

The meta-analysis results come as two files: summary score statistics files (prefix.MetaScore.assoc.gz) and covariance files (prefix.MetaCov.assoc.gz).

In summary score statistics files, you will obtain these columns:

  • N_INFORMATIVE: Number of samples that are analyzed for association.
  • AF: allele frequency. For related individuals, we use BLUE estimator. For case-control study, we list overall frequency (adjusted by relatedness if possible), case frequency and control frequency separated by a colon.
  • INFORMATIVE_ALT_AC: The number of alternative alleles in the analyzed samples.
  • CALL_RATE: The fraction of non-missing alleles. For case-control study, we calculate call rate for all samples, case samples and control samples separated by a colon.
  • HWE_PVALUE: Hardy-Weinberg equilibrium. For related individuals, this statistic can be inflated. For case-control study, we calculate HWE pvalues for all samples, case samples and controls samples separated by a colon.
  • N_REF/N_HET/N_ALT: Number of samples carrying homozygous reference/heterozygous/homozygous alternative alleles. For case-control study, we calculate these three statistics for all samples, case samples and controls samples separated by a colon.
  • U_STAT, SQRT_V_STAT: U and V statistics are score statistics and their covariance matrix. Details can be found in Dajiang Liu (2014) Nature Genetics.
  • ALT_EFFSIZE: for continuous outcome, this is the estimated effect size; for binary outcome, this is the estimated log odds-ratio. We apply a new correction method when binary trait associations for related individuals are analyzed in standard linear mixed models. The log odds ratio is approximately correct for related individual analysis as well.

The covariance files stores covariances in sliding windows. You will obtain these columns:

  • CHROM: the chromosome name
  • START_POS/END_POS: within one sliding window, the first/last variant chromosomal position.
  • NUM_MARKER: number of markers in one sliding window.
  • MARKER_POS: all chromosomal positions in the sliding window.
  • COV: covariances between markers. When the sliding window has markers G.1, G.2, ..., G.N, this column will store "covariances" of (G.1, G.1), (G.1, G.2), ..., (G.1, G.N); for binary outcomes, this column uses colons as separators and additionally stores covariances between N genotypes and K covariates (G.1, C.1), (G.1, C.2), ..., (G.1, C.K), as well as covariances between K covariates (C.1, C.1), (C.1, C.2), ... (C.1, C.K), (C.2, C.2), ..., (C.2, C.K), ..., (C.K, C.K).

Utility models

Rvtests has an convenient option --outputRaw. When specifying this, rvtests can output genotypes, phenotype, covariates (if any) and collapsed genotype to tabular files. These files can be imported into other software (e.g. R) for further analyses.

Association test options

Sample inclusion/exclusion

Rvtests can flexibly specify which sample(s) to include or exclude:

       --peopleIncludeID : give IDs of people that will be included in study
     --peopleIncludeFile : from given file, set IDs of people that will be included in study
       --peopleExcludeID : give IDs of people that will be included in study
     --peopleExcludeFile : from given file, set IDs of people that will be included in study

--peopleIncludeID and --peopleExcludeID are used to include/exclude samples from command line. For example, specify --peopleIncludeID A,B,C will include A, B and C sample from the VCF files if they exists. --peopleIncludeID and --peopleExcludeID followed by a file name will include or exclude the IDs in the file. So to include sample A, B and C, you can provide a file, people.txt, which looks like:

A
B
C

Then use --peopleIncludeFile people.txt to include them in the analysis.

Variant site filters

It is common that different frequency cutoffs are applied in rare-variant analysis. Therefore, rvtests specify frequency cutoffs.

Frequency Cutoff

             --freqUpper : Specify upper minor allele frequency bound to be included in analysis
             --freqLower : Specify lower minor allele frequency bound to be included in analysis

If you specify --freqLower 0.01 --freqUpper 0.05, only the variants with minor allele frequency between 0.01 and 0.05 (boundary inclusive) will be analyzed.

Similar to sample inclusion/exclusion options, you can specify a range of variants to be included by specifying --rangeList option. For example --rangeList 1:100-200 will include the chromosome 1 position 100bp to 200bp region. Alternatively, use a separate file, range.txt, and --rangeFile range.txt to specify association tests range.

             --rangeList : Specify some ranges to use, please use chr:begin-end format.
             --rangeFile : Specify the file containing ranges, please use chr:begin-end format.
              --siteFile : Specify the file containing sites to include, please use "chr pos" format.

It is supported to filter variant site by site depth, minor allele count or annotation (annotated VCF file is needed).

          --siteDepthMin : Specify minimum depth(inclusive) to be included in analysis
          --siteDepthMax : Specify maximum depth(inclusive) to be included in analysis
            --siteMACMin : Specify minimum Minor Allele Count(inclusive) to be included in analysis
              --annoType : Specify annotation type that is followed by ANNO= in the VCF INFO field, regular expression is allowed

NOTE: --annoType can filter variants based on regular expression. For example, --annoType Nonsynonymous will only analyze non-synonymous variants where they have ANNO=Nonsynonymous in the INFO field. To extract more than one annotation types, use --annoType 'Start_Gain|Stop_Loss|Start_Loss|Essential_Splice_Site|Stop_Gain|Normal_Splice_Site|Synonymous|Nonsynonymous will extract LOF (loss of function) mutations. To generate an annotated VCF file, please read on the next section Annotation.

Annotation

We define a VCF file with annotating information as an annotated VCF here. The annotation step can be done using ANNO, a fast and accurate annotation software. At minimal, there are two steps to annotate a VCF file:

  1. Install ANNO and its resources files

    git clone https://github.com/zhanxw/anno.git; cd anno; make; cd resources; ./download.sh

  2. Run the following script:

    anno -i input.vcf -o output.vcf.gz -r anno/resources/hs37d5.fa -g anno/resources/refFlat_hg19.txt.gz -p anno/priority.txt -c anno/codon.txt --indexOutput

You will then obtain an annotated VCF file output.vcf.gz and its tabix index output.vcf.gz.tbi.

Genotype filters

Genotype with low depth or low quality can be filtered out by these options:

          --indvDepthMin : Specify minimum depth(inclusive) of a sample to be included in analysis
          --indvDepthMax : Specify maximum depth(inclusive) of a sample to be included in analysis
           --indvQualMin : Specify minimum depth(inclusive) of a sample to be included in analysis

When genotypes are filtered, they are marked as missing genotypes. Consequently, samples with missing genotype may or may not be included in the analysis. That means samples with filtered genotypes may be dropped (--impute drop) or may still be included (--impute mean or --impute hwe). By default, genotypes are imputed to the mean value. Please note that --impute drop is usually not recommended due to the incurred computational cost, as the null model may be estimated for each marker.

See next section about how you want to handle missing genotypes.

Handle missing genotypes and phenotypes

When genotypes are missing (e.g. genotype = "./.") or genotypes are filtered out, there are three options to handle them: (1) impute to its mean(default option); (2) impute by HWE equilibrium; (3) remove from the model. Use --impute [mean|hwe|drop] to specify which option to use.

When quantitative phenotypes are missing, for example, some samples have genotype files, but not phenotypes, rvtests can impute missing phenotype to its mean.

NOTE: Do not use --imputePheno for binary trait.

In summary, the following two options can be used:

           --impute : Specify either of mean, hwe, and drop
      --imputePheno : Impute phenotype to mean by those have genotypes but no
                      phenotypes

Specify groups (e.g burden unit)

Rare variants association tests are usually performed in groups of variants. The natural grouping unit is gene. Rvtests can read gene definition file in refFlat format, and perform association for each gene. Use --geneFile option to specify the gene file name. For example, --geneFile refFlat_hg19.txt.gz will use refFlat_hg19.txt.gz as gene definition file, and then perform association tests for every gene. Use --gene to specify a subset of genes to test. For example, --gene CFH will only test CFH gene.

Alternative grouping unit can be specified as set. These sets are treated similar to gene. You can thus use --setFile to define sets (similar to --geneFile option), and use --set to define a specific set (similar to --gene option). Additionally, use --setList can specify a set to test from command line.

The format of a set file is: (1) set names; (2) ranges (e.g. chrom:begin-end); For example, you have a set file, example.set, like this:

set1 1:100-200,1:250-300
set2 2:500-600

You can specify --setFile example.set --set set2 to group variants within chromosome 2, position 500 to 600bp. If you want to test a particular region, for example, chromosome 2, position 500 to 550bp, but do not want to make another file, you can use --setList 2:500-600.

In summary, options related to Grouping Unit are listed below:

         --geneFile : specify a gene file (for burden tests)
             --gene : specify which genes to test
          --setList : specify a list to test (for burden tests)
          --setFile : specify a list file (for burden tests, first two columns:
                      setName chr:beg-end)
              --set : specify which set to test (1st column)

Sex chromosome analysis

Rvtests support X chromosome analysis. In human X chromosome, there is PAR (pseudoautosomal region) and non-PAR region. For males, there are two X allele in PAR region and one allele in non-PAR region. While the PAR region is treated in the same way as autosomes, rvtests treat non-PAR region differently. Below we will describe the details about how rvtests handles non-PAR region.

Prepare data. According to VCF standard, male genotype needs to coded as 0 or 1. For compatibility, rvtests also support 0/0 or 1/1 coding. In VCF files, male genotypes can be written as "0", "1", "0|0", "0/0", "1|1", "1/1". All other genotypes will be treated as missing.

Genotype in regression models. For consistence, male genotypes are converted to 0 or 2. When male dosages are provided, we expect the values in the VCF file are between 0.0 and 1.0, and then will model them as twice the value.

MetaScore results. If --meta score is specified, the output file prefix.MetaScore.assoc.gz includes both PAR-region and non-PAR region analysis. However, in the non-PAR region, the difference is that Hardy-Weinberg P-value and homozygous-reference/heterzygous/homozygous-alternative sample sizes are calculated using female samples only.

Related individuals. Just append --xHemi to the vcf2kinship (more details in Kinship generation) and rvtest command lines. Rvtests can recognize non-PAR region kinship file and use it in the analysis.

PAR region. PAR region is defined as two regions X:60001-2699520 and X:154931044-155260560. Use --xLabel can specify which chromosome has PAR region (default: 23 or X) and use --xParRegion to specify PAR region (default: hg19, meaning '60001-2699520,154931044-155260560' in the UCSC build hg19, specify "hg18" will use PAR region definition in the UCSC build hg18, or specify "hg38" will use UCSC build 38).

Kinship generation

Analysis of related individual usually requires estimation of kinship. You can a separate tool, vcf2kinship. vcf2kinship is usually included in rvtests binary distribution or can be built from software source codes.

vcf2kinship can calculate pedigree kinship using a pedigree input file (PED format, see Phenotype file, use option --ped). The output file name is specified by --prefix option. If you use --prefix output then the output files will include output.kinship.

It can also calculate empirical kinship using genotype input file (VCF format, see Genotype file (VCF), use option --inVcf). For empirical kinship, you also need to specify the kinship model, either Balding-Nicols model (use option --bn) or Identity-by-state model (use option --ibs).

In sex chromosome analysis, it is often required to generate kinship on X chromosome regions, then you need to specify --xHemi. If your input VCF file has different X chromosome label (e.g. chromosome name is '23' instead of 'X'), you can use --xLabel 23.

If principal component decomposition (PCA) results are needed, you can use option --pca. Then output files with suffix '.pca' include PCA results.

When dealing with large input files, it is often preferred to use multiple CPU to speed up calculation using the option --thread N in which N is the number of CPU.

For example, to generate pedigree-based kinship (--ped) on both autosomal region and X chromosome (--xHemi) region, the command line is:

vcf2kinship --ped input.ped --xHemi --out output

To generate empirical kinship (--inVcf) on both autosomal region and X chromosome (--xHemi) region using Balding-Nicols model, the command line is:

vcf2kinship --inVcf input.vcf.gz --ped input.ped --bn --xHemi --out output

NOTE: you need to provide a pedigree file (PED) in the above case, as vcf2kinship needs the sex information of samples to construct kinship for sex chromosome.

For modern genetic datasets, genotype data is often stored by chromosomes, with each chromosome stored in a separate VCF file. In this case, kinship matrix can be first calcualted separately for each chromosome, and then combined using the python script combineKinship.py. Specifically, if you generated two kinship matrices for chr1 chr2 (chr1.kinship, and chr2.kinship), you can run the python script to combine them, i.e.

python combineKinship.py -o prefix chr1.kinship chr2.kinship

The resulting kinship matrix is equivalent to the kinship matrix calculated using the merged vcf files.

Resources

UCSC RefFlat Genes

refFlat_hg19.txt.gz

UCSC gene definition file in the refFlat format (Details) in NCBI genome build 37 .

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/refFlat_hg19.txt.gz

Gencode Genes

refFlat.gencode.v19.gz

Gencode gene definition version 19 in the refFlat format (Details). We have also previous versions of gene files and can provide upon request.

File link: http://qbrc.swmed.edu/zhanxw/seqminer/data/refFlat.gencode.v19.gz

Frequently Asked Questions (FAQ)

  • Does rvtests support binary traits of related-individuals?

Yes for meta-analysis models and no for most other models. Proper analyses of related-individual are supported in meta-analysis model. You can refer to section Meta-analysis tests. For many other association models, we notice that supporting binary traits for related individuals is complex and at current stage we have not found good solutions.

  • Can you provide a list of command line options?

Rvtests have build-in help that can be found by executing rvtest --help. We also put all available options in this link.

  • Can you provide standard error (SE) or confidence interval (CI) for the estimated Beta in the score model?

In the output of MetaScore model (--meta score), the standard error is the inverse of SQRT_V_STAT. For example, if SQRT_V_STAT = 2, that means the standard error of estimated beta is 1/2 = 0.5.

  • Why the INFORMATIVE_ALT_AC, N_REF and N_ALT columns have zero counts for certain chromosome X regions in meta-analysis models?

These counts are calculated from female individuals. If your study only has male samples, rvtests cannot report these counts. Because if a male carries a non-reference allele, we cannot conclude that this is heterozygous (0/1) site or homozygous alternatives (1/1) site.

  • What happens when P-values equals to -1?

If rvtests fails to fit a certain model, it also fails to calculate P-value reliably. Rvtests will output P-value as -1 instead of any number between 0 and 1 to indicate that an error has occurred. However, this should rarely happen. Please contact us if you have further questions.

  • Why SKAT Q-values reported by rvtests are different from the SKAT R package?

We strictly follow the notations in the SKAT publication (Wu et al. (2011) AJHG). However, in SKAT R package, its implementation is slightly different. For example, in quantitative trait analysis, Q is divided by (2 * \hat{sigma2}) in the R package, but not in rvtests. Although Q values can be different, the P-values from the two software packages should match (only in rare cases, numerical accuracy may cause minor differences).

  • How rvtests handle multi-allelic variants?

In rvtests, we focus on bi-allelic variants, and thus by-default treat multi-allelic variants as bi-allelic variants. Any genotype that includes other than reference allele and the first alternative allele will be treated as missing. For example, when the reference allele is 'A' in the VCF REF column and the alternative alleles are 'T,G' in the VCF ALT column, the genotype '0/2' will be treated as a missing genotype.

To properly analyze multi-allelic sites, we recommend coding multiple alternative alleles in separate lines. For example, a tri-allelic site A/T/G can be represented in two lines: the first line uses A as the reference allele, T as the alternative allele, and 0 or 1 to represent the frequency of alternative alleles; similary, the second line uses A as the reference allele, G as the alternative allele, and 0 or 1 to represent the frequency of alternative allele G. The following table can be helpful to understand this coding scheme.

Genotype A/A A/T A/G T/T T/G G/G
1st line (REF = A, ALT = T) 0/0 0/1 0/0 1/1 1/0 0/0
2nd line (REF = A, ALT = G) 0/0 0/0 0/1 0/0 0/1 1/1

The rationale of coding multi-allelic sites as multiple lines of bi-allelic sites is as follows: (1) the majority of existing analysis software does not support multi-allelic sites coded in one line; (2) imputation outputs (e.g. Michigan Imputation Server) multi-allelic sites in multiple lines. If your genotype file cannot be encoded as multiple bi-allelic variants, rvtests has an experimental option --multipleAllele. Specifying this option will enable the analysis of multi-allelic variants similar to the anlaysis of multiple bi-alleleic variants on-the-fly.

  • How to adjust for multiple comparisons?

Rvtests does not provide multiple comparison adjustment. However, users can use p.adjust() function provided in R to calculate adjust p-values for multiple comparisons. For example, users can choose "bonferroni" or "BH" (Benjamini and Hochberg). For advanced users, qvalues can be calculated from the qvalue package provided by BioConductor.

  • How to use multiple threads?

You can use --numThread N for rvtest, and --thread N for vcf2kinship. Please note when the sample size is small, multiple threads may be slower than a single thread.

Feedback/Contact

Questions and requests can be sent to Github issue page (link) or Xiaowei Zhan ([email protected]) or Dajiang Liu ([email protected]) or Goncalo Abecasis ([email protected]).

Rvtests is a collaborative effort from Xiaowei Zhan, Youna Hu, Bingshan Li, Dajiang Liu and Goncalo Abecasis.

We want to thank rvtests users and especially those who have provided valuable feedbacks. These users include: Xueling Sim, Scott Verize, Shuang Feng, Kevin Lu, Ruth Loos, Tessel Galesloot, Valerie Turcot, Stefan Gustafsson, Corbin Quick, Adam Locke, Michael Nalls, Jie Huang, Haitao Zhang, the GIANT consortium, the GLGC consortium and the GSCAN consortium.

rvtests's People

Contributors

bitdeli-chef avatar dajiangliu avatar steadylight avatar the-x-at avatar zhanxw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rvtests's Issues

AF column in results and HWE

Dear Xiaowei, I was looking at the columns in my rvtest output but somehow the AF (I guess standing for allele frequency) does not match with the report in the columns N_REF, N_HET and N_ALT.

For example

N_REF=1200; N_HET=125; N_ALT=0; reports in the AF column 0.305

How could this be? from my manual calculation Freq_REF=0.953 and Freq_Alt=1-0.953

Also, the reported HWE Pval is 0.11, how could this be?

I have done the excercise for 17 SNPs (header of one of my files and none of the AF was similar to the reported one
caro_rvtest.xlsx

gsl: fsolver.c:126: ERROR: endpoints do not enclose a minimum

Hi,
I get this error:
gsl: fsolver.c:126: ERROR: endpoints do not enclose a minimum
when running MEN on chr22 (but chrX works).

My command:

rvtest --inVcf "/mnt/isilon/grant_lab/imputation_BMDCS_ALL_1KGv3/chr22.imputed.poly.vcf.gz" --out "BMDCS_EUR_1KGP3_imputation_BMI_INV_MEN_chr22" --pheno "/mnt/isilon/grant_lab/chesia/analyses/GIANT_anthropometrics/analysis/ALL_1KG/BMDCS_EUR_1KG_MALE_pheno.ped" --pheno-name "BMI_INV" --dosage "DS" --meta "score" --kinship "/mnt/isilon/grant_lab/chesia/analyses/GIANT_anthropometrics/analysis/ALL_1KG/kinship_matrix/kinship_matrix_BMDCS_EUR_1KG.kinship"

The same analysis works on FEMALES and ALL, with the same kinship matrix.
Can you please help?

Thanks!
log.txt

Error while running --meta cov

I am getting this same error repeatedly for all --meta cov runs, except that line number is different for the errors. I manually checked the suggested line number in the vcf file and i see nothing abnormal in terms of formatting. Its the same for both plink generated and bcftools generated vcf files. Other analylsis like --single wald or --meta score are fine. Can u please help me here. I need to generate the covariance files to run RAREMETAL.

Effective Options
    --inVcf sample.chr22.vcf.gz
    --out genie_rare/study1/study1_sample.chr22.vcf.gz/meta/sample.chr22.vcf.gz_cov
    --pheno sample1.fam
    --inverseNormal
    --useResidualAsPhenotype
    --meta cov
    --geneFile /faststorage/home/veera/pipelines/genie//resources/rare/refFlat_hg19.txt.gz

[INFO]  Program version: 20170228
[INFO]  Analysis started at: Sat Apr 15 14:48:38 2017
[INFO]  Loaded [ 2504 ] samples from VCF files
[INFO]  Loaded [ 1000 ] sample pheontypes
[WARN]  Total [ 1504 ] samples are dropped from VCF file due to missing phenotype.
[WARN]  Drop sample [ HG02597_HG02597 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02600_HG02600 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02601_HG02601 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02603_HG02603 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02604_HG02604 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02610_HG02610 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02611_HG02611 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02613_HG02613 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02614_HG02614 ] from VCF file due to missing phenotype
[WARN]  Drop sample [ HG02620_HG02620 ] from VCF file due to missing phenotype
[WARN]  Skip outputting additional [ 1494 ] samples due to missing phenotype.
[INFO]  Loaded 0 male, 0 female and 1000 sex-unknonw samples from sample1.fam
[INFO]  Loaded 474 cases, 526 controls, and 0 missing phenotypes
[WARN]  -- Enabling binary phenotype mode --
[WARN]  WARNING: Skip transforming binary phenotype, although you want to use residual as phenotype!
[WARN]  WARNING: Skip transforming binary phenotype, although you required inverse normalization!
[INFO]  Analysis begins with [ 1000 ] samples...
[INFO]  Meta analysis uses window size 1,000,000 to produce covariance statistics under additive model
[INFO]  Loaded [ 22480 ] genes.
[INFO]  Impute missing genotype to mean (by default)
[INFO]  Analysis started
[INFO]  Analyzed [ 57207 ] variants from [ 22480 ] genes/regions
[ti_index_core] the file out of order at line 53

Output specification

Hello,

This could sound silly, but could you please provide a detailed description about each column in the output files?, I have read all the documentation but I couldn't find anything for the output coming from CMC, SKAT, etc... there is only explanation for the meta-analysis output. For example, what does NumPolyVar means in the burden CMC output?

Thank you

VT analytic

Hi Zhanxw,

thank you for your very valuable tool!
I have a question regarding the Variable threshold analytic option.
In the manual it is stated this option supports both binary and quantitative phenotypes.
However when I try to run the analysis, a warning appears that the analytic option doesn't work on a binary phenotype, and all my values are "NA".

Is this merely a mistake in the manual, or is there something else going on?

Best,
Roos

Pvalue shows "-nan" in output of gene level association test

Most of gene-level outputs are reasonable, but there are a few lines show "-nan" in Pvalue.

Gene RANGE N_IN--geneFile ${geneF} --kernel skato
FORMATIVE NumVar NumPolyVar Q rho Pvalue
ILF2 1:153634513-153643479 8493 8 8 4027.77 0.8 -nan
DNAJB11 3:186288466-186303589 8493 3 3 591.324 0 -nan

std::logic_error when generating kinship matrix

Hello all,

I ran into an odd problem when trying to generate a kinship matrix using Rvtest. The dataset I am working on actually consists of two subcohorts that were genotyped and imputed together, but need to be analysed separately. As such, and because I only have limited disk space available, I decided to generate separate kinship matrices for the subcohorts. I had no problem generating a kinship matrix for the larger of the two, but when I did so for the smaller one, rvtest gave the following output:

Effective Options
    --inVcf chrALL.imputed.poly.vcf.gz
    --out kinship_matrix_CC
    --xHemi
    --xLabel X
    --ped dataF_P90c_TRAILS_CC_anthro.txt
    --bn
    --minMAF 0.05
    --thread 16
 
[INFO]   Program version: 20170210
[INFO]   Analysis started at: Mon May 29 16:47:51 2017
[INFO]   Multiple ( 16 ) threads will be used.
[INFO]   Empiricial kinship will be calculated.
strtol: Invalid argument
[INFO]   Start creating empirical kinship from VCF file.
[INFO]   Using default maximum missing rate = 0.05
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
/var/spool/torque/mom_priv/jobs/547534.batch1.lisa.surfsara.nl.SC: line 19: 32228 Aborted                 ~/Software/rvtests/executable/vcf2kinship --inVcf chrALL.imputed.poly.vcf.gz --ped dataF_P90c_TRAILS_CC_anthro.txt --bn --out kinship_matrix_CC --xLabel X --xHemi --minMAF .05 --thread 16

(The “strol: Invalid argument” line is actually repeated exactly 1900 times - I just left the one entry to show where it appears.)

As ped file I use the phenotype file, containing only samples of the smaller subcohort. The command line used is the same as for the larger subcohort, except where it specifies a different ped file and output filename. The phenotypes files of both subcohorts were created in the same way, so it seems unlikely the cause lies there. What does this error mean?

frequency rvtest single variant analysis

Dear Zhan, I am runing a single variant analysis so my command:

../executable/rvtest --pheno pheno --inVcf example.vcf --single wald --out out1

And this command does not output does not report allele frequency , here the header

CHROM POS REF ALT N_INFORMATIVE Test Beta SE Pvalue

I work with a multiethnic cohort so lets say the allele frequency will depend on the subsample for which the phenotype is available. And in order to submit the results for meta-analysis I need to report the freq and it could be handy thus that the frequency is in the output of the program and not that I have to calculated in other software (e.g, PLINK) in parallel everytime I use rvtest

Missing col error in version: 20170228

I am using RVtest Program version: 20170228 and creating a covariance matrix with window size of 500KB, to reduce the size of the matrix. I see the following error in my log files, but the analysis runs to completion.

[ERROR] Missing 'u1141' column!
[WARN] Failed to load spectral decomposition results of the kinship matrix!
[INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 3.4 ] seconds.

Wondering if anyone else has seen this and if this could be a reason why my Covariance matrices are > 400GB (for HRC imputed data) in spite of the requested 500kb window size.
Thanks

Problem using the --meta bolt,boltCov option

When I tried to run rvtests using the --meta bolt,boltCov options I get this output:

MetaFamQtlBolt model started Report 'Cannot open binary PLINK file!

Does anyone understand what caused this and how I can avoid it?

Thank you very much!

What is the file format of gene range file?

I use the range file
FLG NM_002016 chr1 152276780 155000000

But I got the following message:
[INFO] Loaded [ 2 ] genes.
[INFO] Impute missing genotype to mean (by default)
[INFO] Analysis started
[INFO] Gene FLG has 0 variants, skipping

Is there any mistake in my range file?

Besides, I try to modify vcf2ld_gene.cpp file to print more information, but after compiling, the output doesn't change.

If I want to print out more information, what I should do?

For gene level models, do they work with rare variants only (i.e., is it required to set --freqUpper 0.05)?

  1. In the example folder of rvtest package, you did not specify --freqUpper 0.05 in cmd.sh. Here is the example code. If I use --freqUpper 0.05 in the same command, the gene based results will be different. Is it OK if I run gene based analysis without using --freqUpper 0.05?
    ../executable/rvtest --pheno pheno --inVcf example.vcf.gz --setFile setFile --burden cmc,cmcWald,zeggini,zegginiWald --out out9
  2. In the revtests wiki, you mentioned "To extract more than one annotation types, use --annoType 'Start_Gain|Stop_Loss|Start_Loss|Essential_Splice_Site|Stop_Gain|Normal_Splice_Site|Synonymous|Nonsynonymous will extract LOF (loss of function) mutations". Synonymous should not be included in the LOF mutations. Could you give me a reference paper (or resources) to find the definition of LOF (loss of function) mutations? I need some references about LOF mutations for my paper.
    Thanks!

Segmentation fault when reading BGEN file

Dear Xiaowei:

When I use the latest version of rvtest and the BGEN input file, why the log says "Loaded [ 0 ] samples from VCF files"? There is NO VCF file at all. Also, the BGEN file has two part, one is .bgen and the other part is .sample. So, how do I specify the .sample file?

In the phenotype file, does my header line must have the following five as the header names: "fid iid motid fatid sex"? It seems that the command options in RVTEST is very similar to PLINK, such as --covar-name --pheno-name. But in PLINK, the phenotype file only needs the first two columns "FID IID" and then the rest could be anything.

Thank you very much & best regards,
jie

Below is the LOG file:

Thank you for using rvtests (version: 20170818, git: 72ab223)
....
....
Effective Options
--inBgen ukb_imp_chr20_v2.bgen
--out chr20.rvtest
--pheno bmi.EUR.ped
--pheno-name bmi
--single score

[INFO] Program version: 20170818
[INFO] Analysis started at: Wed Sep 13 21:46:39 2017
[INFO] Loaded [ 0 ] samples from VCF files
[INFO] Loaded [ 334348 ] sample pheontypes
Segmentation fault

vcf2kinship not finding samples in pedigree

The latest update (20161018) available here, now sends an error when trying to generate kinship while including the ped file.

When running the following command:
../executable/vcf2kinship --inVcf example.vcf.gz --ped pheno --bn --xHemi --out example.output
Exclude [ 500 ] samples from VCF files because they do not exist in pedigree file or do not have sex
[ERROR] No samples left for analysis, aborting...

Earlier versions of the software did not have this problem.

AF is incorrect in SingleScore.assoc file

Hi, I found that the AF column in the SingleScore.assoc file has negative values and positive values greater than 1. So, I think there is something wrong. Yesterday I recommended RVTESTS to a large audience of our consortium. Can someone please fix this issue before more and more others notice this issue?

Thanks!

Jie

unbelievable slow speed for --siteFile

Please see my comment on #25, it took 43356 seconds to run a regression on 93 SNPs when I used --siteFile. But when i use bcftools first to extract those 93 SNPs to create a new VCF, which takes a minute, then it only took 109 second to run the same analysis

So, i think there is something VERY WRONG with this --siteFile option. Just want to point this out so that others don't run into the same issue.

best regards,
Jie

BGEN format on UK Biobank data does not work properly

Hi, I use RVTESTS latest version to read the UK biobank BGEN file, a command like this: rvtest --inBgen ukb_imp_chr1_v2.bgen --inBgenSample ukb_imp.sample --pheno bmi.pheno.ped --pheno-name bmi --out chr1.rvtest --single score.

Then I compared the results from PLINK and from Ben Neale's group. I found that the results from RVTESTS is way off from the other two, while the other two are pretty much the same. So, I don't know if there is some issue, maybe you could test run it and find out. BTW, the sample order in my bmi.pheno.ped file is different from the ukb_imp.sample file. I hope that is no problem.

Attached below is a comparison on EAF, Beta, and P-value.

Best regards,
Jie
plink-rvtests

wald single test bug?

Dear Dr. Zhan,
Thanks for providing us such an excellent program, rvtests.

  1. I used 6 covariates (age,sex,mds1,mds2,mds3,mds4) in wald single test in rvtests. It seems rvtest reports covariates in the weired order in the output file. For example, age is missing and mds4 was reported twice.

command

rvtest --inVcf ${VCF} --out ${OUT}.single --single wald --numThread 8 --pheno ${phenF} --pheno-name zud5 --covar ${covF} --covar-name age,sex,mds1,mds2,mds3,mds4

output

CHROM POS REF ALT N_INFORMATIVE Test Beta SE Pvalue
1 762485 C A 3018 1:762485 0.000704894 0.0564456 0.990036
1 762485 C A 3018 sex -0.0130407 0.00244367 9.47384e-08
1 762485 C A 3018 mds1 -0.809614 0.0776035 1.75835e-25
1 762485 C A 3018 mds2 -8.40048 2.10399 6.53407e-05
1 762485 C A 3018 mds4 4.49323 11.3983 0.693432
1 762485 C A 3018 mds3 2.85347 13.3588 0.830857
1 762485 C A 3018 mds4 9.50253 13.3434 0.47637

  1. Could you please clarify the missing value in phenotype/covariate files in rvtest? In the document of "Single variant tests", it states that "for binary trait, the recommended way of coding is to code controls as 1, cases as 2, missing phenotypes as -9 or 0". In the description of phenotype file, it states that "In phenotype file, missing values can be denoted by NA or any non-numeric values". In covariate file part, it states that "Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA)".
    Does "NA" always indicate missing in any phen/cov file? Is there a different definition of missing in binary/quantitative traits?
  2. It would be very helpful if rvtests could generate Manhatten plot and QQ plot by default.
    Thank you very much!

Allele frequency and covariate outputs

Dear all, as usually the format required for the meta-analysis centers includes the allele frequency I was wondering if there is a flag to get this column in the output. This might be really important when analysing for example multi-ethnic cohorts as usually a frequency from a sample will not represent the frequency of the whole population (which can be easily obtained from .info files).

Also as in plink there is a flag "--hide-covar" to supress the covariates coefficients from the output... I was wondering if this flag exist or could be implemented in the program.

Thanks a lot. Carolina

Empirical Pvalue

Hi,

Thank you for the very useful tool, I was wondering if it is possible to add the permutation based P value generation for the burden tests as well in addtion to kernel based methods like for example CMC??

Regards,
Dheeraj.

extreme slow speed for "--covar"

Dear Xiaowei:

Please see the two screenshots below. It took me 54,799 seconds to analyze 225,216 samples and 94,238 SNPs, for a 5MB imputed chunk.

Now, when I use "--siteFile" to limit my analysis to 92 SNPs that are genome-wide significant in a 1MB region and use "--covar" to condition on the imputed dosage of the lead SNP. After 12 hours, the message "Analysis started" has not shown up yet.

So, my SNP number goes down from 94,238 to 92, but the running time might be even longer, after I simply used one covariate! I don't know if RVTESTS can be optimized for this type of analysis. If not, can I adjust the covariate value in R first to create phenotype residual and then run RVTESTS without the "--covar" option?

Thank you & best regards,
Jie

capture

capture2

Core-dump possibly due to out of memory

Hi zhanxw,

Thank you for developing and maintaining RVTests.

When I use RVTests v1.9.9-29 downloaded July 29, 2016 with the '--kernel SKATO' option and a relatively large dataset insists of 1400 samples, 240,000 variants and 190,000 line set file results core-dump in a medium memory (48Gbyte) machine. Back trace information generated by gdb is attached: core_gdb.log.txt

This does not happen in a lager memory machine. Decreasing VCF file size or the '--peopleExcludeFile' option also solved the problem even in medium memory machine.

Thus, I guess this core-dump is related to out of memory stuff and I believe that it is better to return 'out of memory error'.

I am happy if my report helps RVTests development.

Best wishes,
Hiro.

--siteFile does not work ...

hi, Xiaowei:

please see the attached screenshot. I used --siteFile to limit to a list of 33 SNPs for analysis, but RVTESTS simply ignored that option and proceed analyzing all variants and therefore take a long long time.

can you please take a look to make sure that --siteFile does work?

best regards,
Jie
capture1

SnpEff annotated VCF

Hello,

I have been playing around with the rvtests docker container for a couple of days and things have been running smoothly, thanks for the nice utility!

Yesterday I attempted using a VCF file annotated with SnpEff for burden testing and I was unable to successfully use the --annoType flag to filter variants (I put in a specific annotation type and none were identified in the tested gene even though I know that there are some - --annoType missense_variant, i also tried puting missense in quotes, etc.). I noticed that you suggest using the anno package that you have developed for VCF annotation, which outputs a different ANN field in the annotated VCF than SnpEff.

SnpEff:
ANN=C|frameshift_variant|HIGH|WBGene00022365|WBGene00022365|transcript|Y92H12BL.4|protein_coding|3/6|c.349delA|p.Arg117fs|349/1194|349/1194|117/397||;LOF=(WBGene00022365|WBGene00022365|1|1.00)

anno (from your site):
ANNO=Nonsynonymous:GENE1|GENE3;ANNOFULL=GENE1/CODING_GENE:+:Nonsynonymous(CCT/Pro/P->CAT/His/H:Base3/30:Codon1/10:Exon1/5):Normal_Splice_Site:Exon|GENE3/CODING_GENE:-:Nonsynonymous(AGG/Arg/R->ATG/Met/M:Base30/30:Codon10/10:Exon5/5):Normal_Splice_Site:Exon|GENE2/NON_CODING_GENE:+:Upstream

The paper associated with the rvtests package says it can operate on annotations from SnpEff, so I am a bit confused. Maybe it is just an issue with not finding the filed correctly because they have different names?

I am using the latest version of SnpEff and the version of rvtetsts in the docker container.

Similarly, I am wondering if given the refFlat file, rvtests can make predictions on the fly regarding higher impact variants based on the transcription start and stop site, and the exon starts and stops.

I think it would be nice to select on high or moderate predicted effects (as predicted by SnpEff) for a given gene. This operation can be easily done using SnpSift, but I am wondering if I am missing some built-in functionality of rvtests.

Thanks for your time

Optimized SKAT (SKAT-O)

Hi Rvtest team,
Thanks for creating wonderful tool.
I am wondering how to use rvtest for Optimized SKAT (SKAT-O)
thanks

thread count for rvtest

Hi folks,

Although the rvtest binary seems to only use a single core for much of its work, we're seeing thread counts jump to take all available cores on the system. Is there a way to limit this? The context is so that we can know what resources to ask for when submitting rvtest jobs to HPC systems where the number of CPUs required must be provided.

Cheers

Extra column Kinship matrix

Dear Xiaowei, for the GIANT/GLC project I have been running all my analysis using rvtest Program version: 20170307. Somehow all of them produce the same log file where an error is reported
"[ERROR] Unexpected column 'u5401'!". For all the 3 cohorts I am managing

But then after the program seems to continue normally. Adam Locke suggested the problem is just affecting chrX, so that I should submit all the autosomals results and communicate with you before submitting chrX results (more than 60 GWAS have already ran). Thus, I need to know my results are valid, and where the problem came from.

I am attaching the log from the kinship creation, an example for the autosomals log file and one for chrX. Please confirm the validity of the files.

Thanks a lot. Carolina
RS2_CEU_1000G_imputation_TC_INV_MALE_chr17_log.txt
RS2_CEU_1000G_imputation_TC_INV_MALE_chrX_log.txt
RS2_kinship_matrix.vcf2kinship_log.txt

Segmentation fault with condition option

Hi,

I'm new to RVTest and genetics, so if this is something simple I apologize in advance. I am trying to run a conditional analysis with one SNP as the covariate. My command line is below, as well as the error message. This runs fine without the condition option. When I add the condition option, I get a few seconds of it running and then the following occurs:

Include sample [ 49679653 ].
./condtest.sh: line 40: 22942 Segmentation fault (core dumped) $RVTEST --inVcf $VCFIN --pheno $PEDIN --dosage DS --covar $PEDIN --pheno-name $PHEN_NAME1 --covar-name $COV_NAME --condition $COND_RANGE --peopleIncludeFile $COMP_LIST --out $ASSCOUT1 --single $STAT_TEST

There are 12333 samples, and the log file has 24736 lines of output, so it appears to have gotten into some processing before the error. I have also experimented with widening the range on the condition, same result.

Any help you can provide would be appreciated.

Thanks,

John

$RVTEST --inVcf $VCFIN --pheno $PEDIN --dosage DS --covar $PEDIN --pheno-name $PHEN_NAME1 --covar-name $COV_NAME --condition 6:32632605-32632605 --peopleIncludeFile $COMP_LIST --out $ASSCOUT1 --single $STAT_TEST

Empty individual column - very strange!!

Hi, Xiaowei:

I got the following error for an imputed chunk. I was able to use "bcftools -l" to extract the sample ID columns without any problem. So, exactly what RVTESTS is complaining about?

BTW, I found that the new version of RVTESTS increased the run time from 25247 seconds to 33770 seconds, for one of my chunks, after you fixed the AF issue. So, are there any other features added that make the software run slow?

Thank you & best regards,
Jie

capture

bgen utilities in executable binary files

  1. Could you please add bgen2vcf, bgenFileInfo utitlies into pre-built executable binary package (for Linux 64bit)?

  2. In the example folder of rvtest package, you did not specify --freqUpper 0.05 in cmd.sh. Here is the example code. If I use --freqUpper 0.05 in the same command, the gene based results will be different. Is it OK if I run gene based analysis without using --freqUpper 0.05?
    ../executable/rvtest --pheno pheno --inVcf example.vcf.gz --setFile setFile --burden cmc,cmcWald,zeggini,zegginiWald --out out9

  3. Summary statistics and covariance matrices of score statistics generated by rvtests meta-analysis models do NOT work with raremetal (v.4.14.1; https://github.com/traxexx/Raremetal). By the way, here I am talking about command line version of raremetal (i.e, it is NOT R package of meta-analysis program). For example, the following example dataset is generated by rvtests, but it cannot be used in raremetal (v.4.14.1; https://github.com/traxexx/Raremetal)

Here is a list of commands for the third question.

example dataset

http://genome.sph.umich.edu/wiki/RareMETALS2

$ more example.summary.files

example1.MetaScore.assoc.gz

example2.MetaScore.assoc.gz

[zhangh13@headnode try_meta]$ more example.cov.files
###example1.MetaCov.assoc.gz

example2.MetaCov.assoc.gz

#!/bin/bash
raremetal --summaryFiles example.summary.files --covFiles example.cov.files --prefix out_example
raremetal --summaryFiles example.summary.files --covFiles example.cov.files --writeVcf --prefix example_pool
bgzip example_pool.pooled.variants.vcf
tabix -pvcf -f example_pool.pooled.variants.vcf.gz
epacts anno --in example_pool.pooled.variants.vcf.gz --out example_pool.pooled.variants_epacts_anno.vcf.gz
gunzip example_pool.pooled.variants_epacts_anno.vcf.gz
raremetal --summaryFiles example.summary.files --covFiles example.cov.files --annotatedVcf example_pool.pooled.variants_epacts_anno.vcf --burden --MB --SKAT --VT --prefix out_example_epacts_anno

Segmentation fault?? Can not run rvtest!!

Hi, Xiaowei:

We have prepared tabixed vcf.gz file and PED file and kinship matrix and prepare to run the rvtest. But when we run the rvtest it response the mistake about Segmentation fault as follows:
[INFO] Program version: 20170210
[INFO] Analysis started at: Mon Apr 24 10:14:36 2017
[INFO] Loaded [ 3788 ] samples from VCF files
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 602 [ 922000077 922000077 0 0 NA NA ] ...
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 1243 [ 922003426 922003426 0 0 NA NA ] ...
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 1456 [ 922003311 922003311 0 0 NA NA ] ...
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 1624 [ 922001212 922001212 0 0 NA NA ] ...
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 1625 [ 922000100 922000100 0 0 NA NA ] ...
strtod: Invalid argument
[WARN] Skip: Missing or invalid phenotype type, skipping line 1996 [ 922000199 922000199 0 0 NA NA ] ...
[INFO] Loaded [ 3782 ] sample pheontypes
[WARN] Total [ 6 ] samples are dropped from VCF file due to missing phenotype.
[WARN] Drop sample [ 922000077 ] from VCF file due to missing phenotype
[WARN] Drop sample [ 922003426 ] from VCF file due to missing phenotype
[WARN] Drop sample [ 922003311 ] from VCF file due to missing phenotype
[WARN] Drop sample [ 922001212 ] from VCF file due to missing phenotype
[WARN] Drop sample [ 922000100 ] from VCF file due to missing phenotype
[WARN] Drop sample [ 922000199 ] from VCF file due to missing phenotype
[INFO] Loaded 1850 male, 1932 female and 0 sex-unknonw samples from phenotypes_all_f4.ped
[INFO] Analysis begins with [ 3782 ] samples...
[INFO] load 1 parameters.
[INFO] Meta analysis uses window size 500,000 to produce covariance statistics under additive model
[INFO] Family-based model specified. Loading kinship file...
[INFO] DONE: Loaded kinship file [ S4F4all_kinship_matrix.kinship ] successfully in [ 20.8 ] seconds.
[INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 268.2 ] seconds.
[WARN] Autosomal kinship loaded, but no hemizygote region kinship provided, some sex chromosome tests will be skipped.
[INFO] Impute missing genotype to mean (by default)
[INFO] Analysis started
Segmentation fault

Why this happen ? and How can we fix it ?
This is part of GIANT project and the deadline is approaching.

Thank you very much!
Best regards

Zishan

"unparsed option" when trying to use *bgen file in version 20171010

Hello,

 Based on comments in thread 38,  I am using the 20171010 rvtest version with a viable *bgen file and phenotype file in *ped format (NOT using sample file as was stated these are not supported yet).  

However the program does not seem to recognize the --inBgen flag, please advise. Relevant output is as follows:

Effective Options
--out test1000_bgen_gp.out
--covar EA.cov
--covar-name age,pc1,pc2,pc3
--sex
--pheno EA.ped
--pheno-name EA
--dosage GP
--meta score
--impute drop
--noweb

Unparsed arguments: --inBgen testrv1000_GP.bgen

best regards,
Polly

AF issue still not fixed in the April 14 version

Dear Xiaowei:

I just sent an email to your GMAIL. I downloaded your new RVTESTS posted yesterday. For one imputed 5MB chunk, I run RVTESTS using the same command and the same data as before, but I still found negative values in the AF column. The score test based on the previous version of RVTESTS used 25247 seconds, but the new version took 59095 seconds. The wald test in the new version takes 31316 seconds.

Also, somehow the new version still shows “version: 20170228”.

So, can you please take a good look, to make sure that the AF issue is indeed resolved? It would be great if you change the version date to the release date so that we are sure that we are using the latest version.

Thank you & best regards,
Jie
screenshot1
screenshot2

LOG message goes to standard error

Hi, Xiaowei:

When i use "bsub -o my.LOG -e my.ERR" to run RVTESTS, i found that the log message goes to the my.ERR file instead of the my.LOG file. Usually, I check whether my.ERR file is empty to determine if there is anything wrong. Now, i have no easy way to determine if my RVTESTS jobs finished successfully, when I run hundreds of jobs. So, can you please make the standard log message goes to the "-o" file?

Also, i mentioned this previously a couple of times. The log file always shows the following two lines:
connect() error: Operation timed out
Retrieve remote version failed, use '--noweb' to skip
However, RVTESTS does not have an "--noweb" option. I think RVTESTS is still using PLINK and this message is generated by PLINK. If so, can users have a way to specify PLINK2 to be used instead?

Thanks!

Jie

AF reported as 0 for all variants with FamScore Test

Hello,

My command is as follows:
rvtest --inVcf vcf --pheno pheno --covar covariate --out out --kinship kinship.output.kinship --dosage DS --single FamScore

The AF column is 0 for all variants.

If the last flag is changed to --meta score, AF is reported.
rvtest --inVcf vcf --pheno pheno --covar covariate --out out --kinship kinship.output.kinship --dosage DS --meta score

AF shows up and all other statistics are the same as with FamScore (p-value, ustat, etc).

Thank you!

no variant name in *.SingleScore.assoc file

Dear Xiaowei:

For the *.SingleScore.assoc. file generated by RVTESTS, it contains "CHROM POS REF ALT N_INFORMATIVE" etc. but, the "ID" field in the VCF is not output. Don't know when you will release a new version, it would be good to display those "ID" field as well. This is the unique identifier for a SNP, right?

Best regards,
jie

output of single variant association results

Hi, Xiaowei,

We noted that single variant association results are a bit over-simplified. I wonder if you could update the software to output something similar to MetaScore files. For example, it would be very useful to output test statistic, genetic effect estimate, the standard deviation of the genetic effect estimate, allele frequencies, p-values, hardy weinberg, call rate etc. Sometimes, users would like to meta-analyze the output with softwares such as METAL, where information such as SD and genetic effect estimates are needed. Thanks!

Dajiang

huge EFFECTS difference between SCORE and WALD test on binary trait

Dear Xiaowei:

Below is a message from my colleague, who identified very different EFFECTS between SCORE and WALD tests.
"For the --single wald test, it took much longer, i.e. 73439 seconds (20.4 hours), but the effect size, SE, and p-values are all quite consistent with what I got from SNPTEST. Then, I looked at the wiki page of Rvtest http://genome.sph.umich.edu/wiki/Rvtests, it says the "score" option only fits null model and the "wald" option fits alternative model. So I think I will need to use the "wald" option anyways for the future GWAS analyses. So it will take about 20.4 hours for 150 jobs, maybe 4 days for 550 jobs.

Then I compared the two tests using both continuous and binary traits. For continuous trait, the EFFFECTS and PVALUES are exactly the same, but for binary trait with covariates including 10 PCs, they are very different. Please see below. Also, I noticed some of the EFFECTS are huge, >5000. Is the EFFECTS here BETA, does that look reasonable/possible?

Thank you very much & best regards,
Jie

picture2

Allele frequency column does not match N_REF,N_HET,N_ALT column

Dear Xiaowei,
thank you for developing rvtests. I am using your tool and have realized that from version 20160504 on the allele frequencies reported and the INFORMATIVE_ALT_AC column do not match the numbers reported in the N_REF,N_HET and N_ALT columns if the dosage tag (--dosage) is used. Also the results in the allele frequency column, the INFORMATIVE_ALT_AC column and the ALT_EFFSIZE column differ between version 20151007 and newer versions when using the --dosage tag and the same input file. The N_REF,N_HET and N_ALT columns, pvalues and the U_STAT,SQRT_V_STAT columns remain the same between versions. Could you tell me where the differences occur from?
Thank you very much and best wishes,
Sarah

could not specify multiple phenotypes.

Hi, RVTESTS allow "--covar-name cov1,cov2,cov3", but I found that I could not use "--pheno-name trait1,trait2,trait3". It gave me an ERROR : cannot locate phenotype header [trait1,trait2,trai3].

I did see the "--multiplePheno" option. I tried it and it did not work either. So, can you please let me know how i could specify multiple phenotypes to run at the same command?

Thank you very much!

Jie

VCF header have LESS people than VCF content for kinshipmatrix!!--Why this happen??

Dear Xiaowei

I am using vcf2kinship command to generate kinship matrix, the command are as below:
vcf2kinship --inVcf chrall_notmonomorph.vcf.gz --ped phenotypes_all.ped --bn --minMAF 0.050000 --thread 12 --out all_kinship_matrix

rvtest start to generate kinshipmatrix and give the information as follow:
[INFO] Empiricial kinship will be calculated.
[WARN] Warning: Specified parameter --ped has no effect.
strtol: Invalid argument
strtol: Invalid argument
strtol: Invalid argument
strtol: Invalid argument
strtol: Invalid argument
strtol: Invalid argument
[INFO] Start creating empirical kinship from VCF file.
[INFO] Using default maximum missing rate = 0.05
[INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex:
922000077 922003426 922003311 922001212 922000100 922000199
[INFO] Total [ 3782 ] individuals from VCF are used.
Total [ 16210000 ] VCF records ( Expected 3788 individual but only have 852 individualve processed.
Report 'VCF header have LESS people than VCF content!' to [email protected]
Error line [ 14 ... ]

It said 'VCF header have LESS people than VCF content!' why this error happen and how can I fix it?
In addition, I use command as follow to bgzip VCF file and tabix VCF.gz before I generate kinship matrix :
(grep ^"#" chr1_notmonomorph.vcf;grep -v ^"#" chr1_notmonom orph.vcf| sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > chr1_notmonomorph.vcf.gz

tabix -f -p vcf chr1_notmonomorph.vcf.gz

Thank you very much for your help!

Zishan

compile fails in ubuntu 14.04

hello! there seems to be a problem with "make" finding the header for pcreposix. below is the relevant compiler output, including an unrelated warning which might also be an issue:

/home/biocog/projects/rvtests/third/samtools/libbam.a(knetfile.o): In function `socket_connect':
/home/biocog/projects/rvtests/third/samtools/knetfile.c:98: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

Main.o: In function `main':
Main.cpp:(.text.startup+0x172d): undefined reference to `pcreposix_regcomp'
Main.cpp:(.text.startup+0x3992): undefined reference to `pcreposix_regfree'
Main.cpp:(.text.startup+0x48a2): undefined reference to `pcreposix_regerror'
/home/biocog/projects/rvtests/libVcf/lib-vcf.a(VCFExtractor.o): In function `VCFExtractor::passFilter()':
VCFExtractor.cpp:(.text+0x10db): undefined reference to `pcreposix_regexec'
VCFExtractor.cpp:(.text+0x1105): undefined reference to `pcreposix_regerror'
/home/biocog/projects/rvtests/libVcf/lib-vcf.a(VCFFilter.o): In function `VCFSiteFilter::~VCFSiteFilter()':
VCFFilter.cpp:(.text+0xb4): undefined reference to `pcreposix_regfree'
/home/biocog/projects/rvtests/libVcf/lib-vcf.a(VCFFilter.o): In function `VCFSiteFilter::~VCFSiteFilter()':
VCFFilter.cpp:(.text+0x124): undefined reference to `pcreposix_regfree'
collect2: error: ld returned 1 exit status
make[1]: *** [/home/biocog/projects/rvtests/executable/rvtest] Error 1
make[1]: Leaving directory `/home/biocog/projects/rvtests/src'
make: *** [release] Error 2

the pre-compiled binary seems to run fine on my machine but only on one core. i am working with very large files so multi-threading ability is necessary.

thank you for your excellent contribution to opensource bioinformatics software!

mike d

SNP ID/rsID in output

Would it be possible to add a column in the results output for SNP ID/rsID for the marker?

impute drop cause Segmentation fault (core dumped)

Hello,

I try to run rvtests for firth logistic regression. Every time I choose --impute drop, I got the segmentation fault.

Also I noticed that it seems only the first SNP got p-value calculated.All the rest of SNPs got -nan in the p-value column except the first SNP.

still negative AF, for binary trait

Hi, I just noticed that there are still negative AF value generated by RVTESTS, when my phenotype is a binary trait.

Can you please look into this?

Best regards,
Jie

VCF Line does not have correct column number and exsiting!

Dear Xiaowei

I have checked single chrosome vcf.gz file and bgziped chr.all vcf.gz file for generating kinship matrix.
It seems that in chr.all vcf.gz file some VCF line does not have correct column number as follows:

04-30 14:54 Line [ 8869314 ] does not have correct column number, exiting!
04-30 14:54 Current line has 2251 columns.
05-02 23:51 Line [ 16287134 ] does not have correct column number, exiting!
05-02 23:51 Current line has 861 columns.

Because we already checked single VCF.gz, so this error may be due to the bgzip code for bgziping all chromosome ?

The code for bgzip all chromosome are as follows:
(zcat chr1_notmonomorph.vcf.gz;
zgrep -v '^#' chr2_notmonomorph.vcf.gz;
zgrep -v '^#' chr3_notmonomorph.vcf.gz;
zgrep -v '^#' chr4_notmonomorph.vcf.gz;
zgrep -v '^#' chr5_notmonomorph.vcf.gz;
zgrep -v '^#' chr6_notmonomorph.vcf.gz;
zgrep -v '^#' chr7_notmonomorph.vcf.gz;
zgrep -v '^#' chr8_notmonomorph.vcf.gz;
zgrep -v '^#' chr9_notmonomorph.vcf.gz;
zgrep -v '^#' chr10_notmonomorph.vcf.gz;
zgrep -v '^#' chr11_notmonomorph.vcf.gz;
zgrep -v '^#' chr12_notmonomorph.vcf.gz;
zgrep -v '^#' chr13_notmonomorph.vcf.gz;
zgrep -v '^#' chr14_notmonomorph.vcf.gz;
zgrep -v '^#' chr15_notmonomorph.vcf.gz;
zgrep -v '^#' chr16_notmonomorph.vcf.gz;
zgrep -v '^#' chr17_notmonomorph.vcf.gz;
zgrep -v '^#' chr18_notmonomorph.vcf.gz;
zgrep -v '^#' chr19_notmonomorph.vcf.gz;
zgrep -v '^#' chr20_notmonomorph.vcf.gz;
zgrep -v '^#' chr21_notmonomorph.vcf.gz;
zgrep -v '^#' chr22_notmonomorph.vcf.gz;
zgrep -v '^#' chrX_notmonomorph.vcf.gz)| bgzip -c > HRC_chrall.vcf.gz

Is this correct?

In addition, another thing came to my mind:
our statistic plan ask us to use follow command to bgzip and index VCF file:
(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file
tabix -f -p vcf $your_vcf_file

this command are used for group-based rare variant tests.

If we don't conduct group-based rare variant tests, should we just use command as follow:
bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz

Thank you very much!

Best regards

Zishan

bed bim fam as inputs

how to use rvtest with .bed .bim .fam files as inputs ?
Or I am i missing simple thing?

Missing and unexpected column when running chrX analysis.

Dear Xiaowei

I have started running chrX analysis using generated kinship and xHemi Kinship. But when I running the command as follow :
rvtest-new --inVcf ChrX.combined.reorder.vcf.gz --pheno phenotypes0602.ped --pheno-name men_INV --kinship kinship_matrix.kinship --xLabel X --xHemiKinship chrx.kinship_matrix.xHemi.kinship --meta score,cov[windowSize=500000] --dosage DS --out HRCinv_men_chr23

I found some strange message as follows:
[INFO] Loaded 1386 male, 1490 female and 0 sex-unknonw samples from phenotypes.ped
[INFO] Analysis begins with [ 2876 ] samples...
[INFO] Family-based model specified. Loading kinship file...
[INFO] DONE: Loaded kinship file [ kinship_matrix.kinship ] successfully in [ 11.0 ] seconds.
[INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 124.7 ] seconds.
[INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis
[INFO] DONE: Loaded kinship file [ chrx.kinship_matrix.xHemi.kinship ] successfully in [ 20.9 ] seconds.
[ERROR] Missing 'u1805' column!
[WARN] Failed to load spectral decomposition results of the kinship matrix!
[INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 202.1 ] seconds.
[INFO] Impute missing genotype to mean (by default)
[INFO] Use dosage genotype from VCF flag DS.
[INFO] Analysis started

It seems there are some error about missing column in kinship matrix and lead to fail to load spectral decomposition results of the kinship matrix, but after this warning message it said "Spectral decomposition of the kinship matrix succeeded".
Is this Ok to run chrX analysis ?
The strange thing is when I check chrx kinship matrix manually I didn't find this missing column in 'u1805'...

In addition , some times I also receive error massage about 'Unexpected column' as the previous issue...

Thank you for your help!

zishan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.