broadinstitute / gtex-v8 Goto Github PK

Notebooks and scripts for reproducing analyses and figures from the V8 GTEx Consortium paper

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.93% Python 0.07%

gtex-v8's Introduction

GTEx V8

This repository contains notebooks and scripts for reproducing analyses and figures from the paper The GTEx Consortium atlas of genetic regulatory effects across human tissues, Science, 2020.

Requirements

The following Python modules are needed to run the notebooks: numpy, pandas, scipy, ipython, jupyter, matplotlib, seaborn, qtl

The notebooks require data from the GTEx Portal to run (by default, the data is assumed to be accessible in the data directory of this repository). Running the following commands will download the relevant files:

cd data

# QTLs
wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL.tar && \
    tar xf GTEx_Analysis_v8_eQTL.tar && rm GTEx_Analysis_v8_eQTL.tar

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL.tar && \
    tar xf GTEx_Analysis_v8_sQTL.tar && rm GTEx_Analysis_v8_sQTL.tar

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_independent.tar && \
    tar xf GTEx_Analysis_v8_eQTL_independent.tar && rm GTEx_Analysis_v8_eQTL_independent.tar

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL_independent.tar && \
    tar xf GTEx_Analysis_v8_sQTL_independent.tar && rm GTEx_Analysis_v8_sQTL_independent.tar

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_trans_eGenes_fdr05.txt

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_trans_sGenes_fdr05.txt

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_expression_matrices.tar && \
    tar xf GTEx_Analysis_v8_eQTL_expression_matrices.tar && rm GTEx_Analysis_v8_eQTL_expression_matrices.tar

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_covariates.tar.gz && \
    tar xf GTEx_Analysis_v8_eQTL_covariates.tar.gz && rm GTEx_Analysis_v8_eQTL_covariates.tar.gz

wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL_groups.tar.gz && \
    tar xf GTEx_Analysis_v8_sQTL_groups.tar.gz && rm GTEx_Analysis_v8_sQTL_groups.tar.gz

# fine mapping results
wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_v8_finemapping_CAVIAR.tar && \
    tar xf GTEx_v8_finemapping_CAVIAR.tar && rm GTEx_v8_finemapping_CAVIAR.tar
wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_v8_finemapping_CaVEMaN.tar && \
    tar xf GTEx_v8_finemapping_CaVEMaN.tar && rm GTEx_v8_finemapping_CaVEMaN.tar
wget https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_v8_finemapping_DAPG.tar && \
    tar xf GTEx_v8_finemapping_DAPG.tar && rm GTEx_v8_finemapping_DAPG.tar

# annotation
wget https://storage.googleapis.com/gtex_analysis_v8/reference/gencode.v26.GRCh38.genes.gtf
wget https://storage.googleapis.com/gtex_analysis_v8/reference/WGS_Feature_overlap_collapsed_VEP_short_4torus.MAF01.txt.gz

A subset of the figures require genotype information. The VCF can be obtained from dbGaP (accession phs000424) or from AnVIL (requires dbGaP authentication).

gtex-v8's People

Contributors

Stargazers

Watchers

Forkers

lascanoj lichenbiostat genomicsnx yangchuhua npanousis shicheng-guo lilly-js chris-lad zerland geng-lee gomezdj

gtex-v8's Issues

PEER Correction

Hello all,

We are currently calculating PEER factors for GTEx v8. As pre-processing, we applied almost the same filter that you did, excepting for the TPM filter, and sample selection. and the normalisation method is also different. Anyway, we wanted to redo PEER since the samples are in the end not the same as the complete dataset.

With 1000 iterations, half of the tissues did not converge. We decided to run them again with a larger amount of permitted iterations (100000). It’s been running for almost a month so far and it's still not converging. The completed iterations varies from 1500 to over 9000 for now.

We were wondering if it is normal that it takes so much time to converge.
Thanks a lot for you help.
Best,

Jean-Christophe

qval vs pval_beta in .egenes.txt.gz and .signif_variant_gene_pairs.txt.gz

Hi,

First off, thank you for all of the awesome work with GTEx v8. I am trying to compare my eQTL results to the v8 GTEx database (see which eQTLs are replicated vs novel to my cell type).

I am thinking it would be optimal for me to compare my eQTLs with q-value < 0.05 to the *.signif_variant_gene_pairs.txt.gz files, however, am struggling to find (or derive) q-values for each of these variant-gene pairs in GTEx:
https://gtexportal.org/home/datasets

As I understand the documentation from:
https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/README_eQTL_v8.txt
the order of p-value adjustment follows the path of:
pval_nominal -> pval_perm -> pval_beta -> qval

In the files I see:

The *.allpairs.txt.gz files only contain pval_nominal.
The *.signif_variant_gene_pairs.txt.gz files contain adjustments through pval_beta. Since pval_beta tends to exceed 0.05 in these files, I assume that this list of "significant variant-gene pairs" was the subset of *.allpairs.txt.gz with 'qval' ≤ 0.05, however, there is no qval column?
The *.egenes.txt.gz files contain adjustments through qval.

I am confused as to why qval isn't included in all 3 file types. Given that the "eGenes are the rows with qval ≤ 0.05," I assume I should be using qval to identify which eQTLs replicate between my analysis and GTEx. It would be really helpful if the *.allpairs.txt.gz files contained a qval column for this comparison (or code for which I could derive the qval column?). At first I tried to do qvalue(pval_beta) for the subset of significant variant-gene pairs, but this fails to replicate the qval values since it is a subset of the variant-gene pairs.

Thanks in advance for any help you might be able to provide,
Chris

dominant and recessive modelling for GTEx-v8

Hi François,
Do we have dominant and recessive modeling results for GTEx-v8?
Thanks
Shicheng

Would it be possible that some egenes that have qval <=0.05 that inside the .egenes.txt file not appears in the .signif_variant_gene_pairs.txt file?

Hi!
After struggling to search and carefully read all the materials online I still have this problem:

To simplify

Is that a normal situation that some of the eGenes selected with qval <=0.05 from *.genes.txt are not included in the *.signif_variant_gene_pairs.txt ? Or is there something wrong if that happened?

Detailed description

I have used the gtex-v8 pipeline to call eQTLs on my own dataset. And I successfully got the final results:
*.genes.txt (which I confirmed is the same as *.egenes.txt file).

Using this file, I selected the significant eGenes through filtering the qval <=0.05.
I also used the annotate_outputs.py to generate the *.signif_variant_gene_pairs.txt from my *.allpairs.txt and*.genes.txt

My aim is to check all the significant variants for the eGenes that with qval <=0.05. (I understand that in the original *.genes.txt there was only the most significant variant shown.)

The strange thing is some eGenes with qval <=0.05 did not appear in *.signif_variant_gene_pairs.txt.

Then I have checked the code in annotate_outputs.py, and I found that those egenes that did not show in *.signif_variant_gene_pairs.txt have a common feature: pval_nominal is larger than the pval_nominal_threshold(I checked them from the *.genes.txt ). That made them lost in the significant gene pairs file.

In my understanding and reading, you have mentioned that

the *.signif_variant_gene_pairs.txt files contain all pairs that pass the significance threshold **for each eGene. So all the eGenes (which means the genes that have a qval <0.05 in the *.genes.txt file? )should have at least one pair in the *.signif_variant_gene_pairs.txt . But it did not.

I also read Section_ 4.2 of the supplementary materials for the paper https://www.science.org/doi/full/10.1126/science.aaz1776 (from which you mentioned in the issue #2 ). It seems the assessment for eGenes(qval <=0.05) and significant gene-variants pairs are different:

The beta distribution-extrapolated empirical P-values from FastQTL were used
to calculate gene-level q-values [73] with a fixed P-value interval for the estimation of π0 (the ‘lambda’ parameter was set to 0.85).

For each gene, variants with a nominal P-value below the gene-level threshold (fig. S8) were
considered significant and included in the final list of variant-gene pairs

Is that means different statistical ways were used in these two table? I have tried my best to understand the statistical model that was mentioned but I think I might be thinking wrong in some way. Thanks for your reading and hoping for your reply!

sb-eQTLs dataset in GTEx portal

Hello,

If I understand correctly, the sb-eQTLs available in the portal are single-tissue conditionally independent cis-eQTLs. In the genes I have looked up so far, there seems to be one eQTL per gene and per tissue, but was expecting to have sometimes more than one independent eQTL per gene within a tissue. Was the sex biased analysis performed only on significant cis-eQTLs in the sex-combined analysis? Or was there some prior filtering done? Any information to better understand how that dataset was generated would help!

Thanks in advance!!

Data for Supplementary Figure S34

Hello,

I would like to know if you have made available the correlation data shown as heatmap plots on the Supplementary Figure S34 (i.e. pairwise tissue sharing). I am especially interested in the correlation data for panels A, B and F.

Thanks in advance!

Frida

INDEL, Structural Variation in eQTL, sQTL

Dear Team,

I am wondering whether INDEL, CNV and Structural Variation are included in eQTL and sQTL analysis? or only SNPs are included?

Thanks.

Shicheng