Git Product home page Git Product logo

treewas's Introduction

TreeWAS

https://travis-ci.org/mcveanlab/TreeWAS.svg?branch=master https://img.shields.io/badge/License-MIT-yellow.svg

This repository contains R code to perform TreeWAS analysis. For a full description of the method see preprint here or available within this repository.

Install

  • The code is available as an R package. To install the package clone the repository and install:
R CMD INSTALL TreeWAS

or alternatively, do this from within an R session

library(devtools)
install_github("mcveanlab/TreeWAS/TreeWAS")

Input File Formats

Data for TreeWAS analysis is encoded in three files.

Sample file

Assumes two columns with header. First column has samples IDs and second column contains either a genotype (0/1/2) or a continuous variable such as a genetic risk score (GRS). Sample ID must be the first column.

ID GRS
1 2.91109682117209
2 3.58725581286855
3 4.0187625426125
4 4.25960701964853
5 3.38657230426473
6 4.27097017945458
7 3.72455637560711

Phenotype file

A file with multiple columns and a header. One row per individual. The column ID contains sample IDs and it doesn’t need to be the first column. All other columns are assumed to be phenotypes coded as 0 or 1, column names are phenotype codes which should be present in the tree file (matching column coding in the tree file).

ID R198 R104 M512 L720 M8414 L031 K802 S662 K267
1 1 1 1 0 0 0 0 0 0 
2 0 0 0 1 0 0 0 0 0 
3 0 0 0 0 1 0 0 0 0 
4 0 0 0 0 0 1 0 0 0 
5 0 0 0 0 0 0 1 1 0 
6 0 1 0 0 0 0 0 0 1 
7 0 0 0 0 0 0 0 0 0 
8 0 0 0 0 0 0 0 0 0 
9 0 0 0 0 0 0 0 0 0 

Tree file

File defining the topology of the diagnosis tree. These files can be downloaded from the UK Biobank Showcase website. For example, the tree describing the encoding of Non-cancer Illnesses (data-field 20002) is encoded using Data-Coding 6 of the UK Biobank. This file can be downloaded here. File is assumed to be tab-delimited. A function provided in the package will parse and sort the tree for TreeWAS analysis. The first few lines of the tree for Data-Coding 19 (corresponding to the ICD-10 tree) are shown below.

coding	meaning	node_id	parent_id	selectable
A00	A00 Cholera	286	23	N
A000	A00.0 Cholera due to Vibrio cholerae 01, biovar cholerae	287	286	Y
A001	A00.1 Cholera due to Vibrio cholerae 01, biovar el tor	288	286	Y
A009	A00.9 Cholera, unspecified	289	286	Y
A01	A01 Typhoid and paratyphoid fevers	290	23	N
A010	A01.0 Typhoid fever	291	290	Y
A011	A01.1 Paratyphoid fever A	292	290	Y
A012	A01.2 Paratyphoid fever B	293	290	Y
A013	A01.3 Paratyphoid fever C	294	290	Y

Sample inclusion file

A list of sample IDs to include in the analysis can be parsed to the script with the --keep argument. We assume one sample ID per line.

Running TreeWAS

Three scripts are provided to run TreeWAS analysis with different type of genetic variation and/or genetic models.

Analysing a genetic risk score

The script grs_tree_analysis.R performs TreeWAS analysis on a GRS. The script takes the following arguments:

ArgumentDescription
sample_fileSample file. Cannot be null.
pheno_filePhenotype file. Cannot be null.
tree_fileTree file. Cannot be null.
outprefixPrefix to use for results filenames. Defaults to “out”.
thetaPrior on the mutation rate. Defaults to 1/3.
p1Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keepSample inclusion filename. Optional.
num.coresNumber of cores to use. Defaults to 1. If greater than one the parallel package will be used.
b1_max_magThe prior is symmetric around zero. This parameter controls the range of the effect size.
b1_spacThe grid size.

To do a GRS analysis on the test data, use the following command.

./scripts/grs_tree_analysis.R \
    --sample_file=example_data/sample_file_grs.txt \
    --tree_file=example_data/tree_example_ICD10_Chap_VI.txt \
    --pheno_file=example_data/phenotype_file.txt \
    --outprefix=test_grs.res \
    --num.cores=1

Case-control study

The scripts cc_snp_tree_analysis.R and cc_snp_tree_analysis_additive.R perform case-control association analysis. The scripts take the following arguments:

ArgumentDescription
sample_fileSample file. Cannot be null.
pheno_filePhenotype file. Cannot be null.
tree_fileTree file. Cannot be null.
outprefixPrefix to use for results filenames. Defaults to “out”.
thetaPrior on the mutation rate. Defaults to 1/3.
p1Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keepSample inclusion filename. Optional.
num.coresNumber of cores to use. Defaults to 1. If greater than one the parallel package will be used.
b{1,2}_max_magThe prior is symmetric around zero. This parameter controls the range of the effect sizes (b1 for the het genotype and b2 for the hom).
b{1,2}_spacThe grid size.

To Run the analysis with the test data fitting an additive model do:

./scripts/cc_snp_tree_analysis_additive.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen.res' \
    --b1_max_mag=2 \
    --b1_spac=0.02 \
    --num.cores=1

or with a full genetic model:

./scripts/cc_snp_tree_analysis.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen2.res' \
    --theta=0.33333 \
    --p1=0.001 \
    --b1_max_mag=3 \
    --b2_max_mag=3 \
    --b1_spac=0.02 \
    --b2_spac=0.02 \
    --num.cores=1

Citation

If you use TreeWAS in your work, please cite us:

Cortes A., et al. (2017) Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. bioRxiv 105122. doi: https://doi.org/10.1101/105122

treewas's People

Contributors

ac257 avatar shajoezhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

treewas's Issues

A feature request: phenotype transformation and --include option

Hi, there:

This tool is really nice. However, I have difficulty to follow the scripts to run it.

First, my phenotype file is like below, from running ukbconv.
ICD

Can you please provide a script that could convert such a data into the format as below?
ID R198 R104 M512 L720 M8414 L031 K802 S662 K267
1 1 1 1 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0_

Since there are ~20,000 ICD codes and 500,000 individuals, it might be too big to put all the codes in a single file. So, it would be necessary to have an option to specify a "phenotype inclusion file". For example, I might want to analyze the sub-tree for circulatory diseases, i.e., all ICD-codes starting with letter "I".

Please kindly let me know if that is doable.

Best regards,
Jie

How to include known covariates?

Many thanks for developing the interesting TreeWAS method and putting together this wonderful set of scripts!

I wonder if I could easily add a few known covariates (e.g. sex, age, genotype-based PCs) in the model given the current implementation. I couldn't find a corresponding file to input the covariate information.

If not, what are the recommended ways of dealing with these covariates when performing TreeWAS analysis?

Thanks in advance for your time and help on this issue!

Issue with "'s" in tree_example_ICD10_Chap_VI.txt in example_data

Hello, I wanted to inform you that there is an issue with "'s" in the tree_example_ICD10_Chap_VI.txt in the example_data folder. When running the grs_tree_analysis.R file with the example files given, the dimensions of the results are 122x8 instead of 155x8. We fixed this issue by replacing all "'s" with "s".

Candidate genes and prevalence

Dear author

I am Dr Nicola Palmieri from the Vetmeduni Vienna (Austria). I performed a GWAS using 2024 E. coli avian strains by fitting the phenotype "pathogenic in chicken" encoded as 1 (pathogenic) and 0 (non-pathogenic) and using gene presence/absence as genotype input.

I have got a list of 65 candidate genes, however, when I compute the prevalence of each candidate gene in each phenotype, I also get genes with high prevalence in non-pathogenic strains vs pathogenic. So my question is: Why do I get genes with high prevalence in non-pathogenic strains and low prevalence in pathogenic strains? Is treeWAS detecting candidates in both directions? If yes, can I simply filter the candidates selecting only the ones that have at least 50% prevalence in pathogenic strains and a ratio of pathogenic/non-pathogenic > 1 or 1.something?

Thank you!
Nicola

Phenotypes influencing each other

Say you hypothetically had three phenotypes - cough, headache, and congestion. These phenotypes will all highly correlate with each other among individuals. Wouldn't this correlation inflate the Bayes factor? How can I account for correlation among tested phenotypes?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.