Git Product home page Git Product logo

tensorqtl's Introduction

tensorQTL

tensorQTL is a GPU-based QTL mapper, enabling ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.

If you use tensorQTL in your research, please cite the following paper: Taylor-Weiner, Aguet, et al., Genome Biol. 20:228, 2019.

Empirical beta-approximated p-values are computed as described in FastQTL (Ongen et al., 2016).

Install

You can install tensorQTL using pip:

pip3 install tensorqtl

or directly from this repository:

$ git clone [email protected]:broadinstitute/tensorqtl.git
$ cd tensorqtl
# set up virtual environment and install
$ virtualenv venv
$ source venv/bin/activate
(venv)$ pip install -r install/requirements.txt .

Requirements

tensorQTL requires an environment configured with a GPU. Instructions for setting up a virtual machine on Google Cloud Platform are provided here.

Input formats

tensorQTL requires three input files: genotypes, phenotypes, and covariates. Phenotypes must be provided in BED format (phenotypes x samples), and covariates as a text file (covariates x samples). Both are in the format used by FastQTL. Genotypes must currently be in PLINK format, and can be converted as follows:

plink2 --make-bed \
    --output-chr chrM \
    --vcf ${plink_prefix_path}.vcf.gz \
    --out ${plink_prefix_path}

Examples

For examples illustrating cis- and trans-QTL mapping, please see tensorqtl_examples.ipynb.

Running tensorQTL from the command line

This section describes how to run tensorQTL from the command line. For a full list of options, run

python3 -m tensorqtl --help

cis-QTL mapping

Phenotype-level summary statistics with empirical p-values:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis

All variant-phenotype associations:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis_nominal

This will generate a parquet file for each chromosome. These files can be read using pandas:

import pandas as pd
df = pd.read_parquet(file_name)

Conditionally independent cis-QTL (as described in GTEx Consortium, 2017):

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --cis_output ${cis_output_file} \
    --mode cis_independent

trans-QTL mapping

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode trans

For trans-QTL mapping, tensorQTL generates sparse output by default (associations with p-value < 1e-5). cis-associations are filtered out. The output is in parquet format, with four columns: phenotype_id, variant_id, pval, maf.

Running tensorQTL as a Python module

TensorQTL can also be run as a module to more efficiently run multiple analyses:

import pandas as pd
import tensorqtl
from tensorqtl import genotypeio, cis, trans

Loading input files

Load phenotypes and covariates:

phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(phenotype_bed_file)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T  # samples x covariates

Phenotypes must be provided in BED format (for compatibility with FastQTL), with a single header line and the first four columns containing: chr, start, end, phenotype_id. end is assumed to correspond to the TSS (or center of the cis-window). The remaining columns correspond to samples. covariates_file is assumed to be tab-delimited and in the format covariates x samples.

Genotypes can be loaded as follows, where plink_prefix_path is the path to the VCF in PLINK format:

pr = genotypeio.PlinkReader(plink_prefix_path)
# load genotypes and variants into data frames
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]

To save memory when using genotypes for a subset of samples, you can specify the samples as follows (this is not strictly necessary, since tensorQTL will select the relevant samples from genotype_df otherwise):

pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)

cis-QTL mapping: permutations

cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)
tensorqtl.calculate_qvalues(cis_df, qvalue_lambda=0.85)

cis-QTL mapping: summary statistics for all variant-phenotype pairs

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                covariates_df, prefix, output_dir='.')

cis-QTL mapping: conditionally independent QTLs

This requires the output from the permutations step (map_cis) above.

indep_df = cis.map_independent(genotype_df, variant_df, cis_df,
                               phenotype_df, phenotype_pos_df, covariates_df)

cis-QTL mapping: interactions

Instead of mapping the standard linear model (p ~ g), includes an interaction term (p ~ g + i + gi) and returns full summary statistics for this model. The interaction term is a pd.Series mapping sample ID to interaction value. With the run_eigenmt=True option, eigenMT-adjusted p-values are computed.

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df, prefix,
                interaction_s=interaction_s, maf_threshold_interaction=0.05,
                group_s=None, run_eigenmt=True, output_dir='.')

Full summary statistics are saved as parquet files for each chromosome, in ${output_dir}/${prefix}.cis_qtl_pairs.${chr}.parquet, and the top association for each phenotype is saved to ${output_dir}/${prefix}.cis_qtl_top_assoc.txt.gz. In these files, the columns b_g, b_g_se, pval_g are the effect size, standard error, and p-value of g in the model, with matching columns for i and gi. In the *.cis_qtl_top_assoc.txt.gz file, tests_emt is the effective number of independent variants in the cis-window estimated with eigenMT, i.e., based on the eigenvalue decomposition of the regularized genotype correlation matrix (Davis et al., AJHG, 2016). pval_emt = pval_gi * tests_emt, and pval_adj_bh are the Benjamini-Hochberg adjusted p-values corresponding to pval_emt.

trans-QTL mapping

trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, return_sparse=True)

tensorqtl's People

Contributors

francois-a avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.