TreeWAS

This repository contains R code to perform TreeWAS analysis. For a full description of the method see preprint here or available within this repository.

Install

The code is available as an R package. To install the package clone the repository and install:

R CMD INSTALL TreeWAS

or alternatively, do this from within an R session

library(devtools)
install_github("mcveanlab/TreeWAS/TreeWAS")

Input File Formats

Data for TreeWAS analysis is encoded in three files.

Sample file

Assumes two columns with header. First column has samples IDs and second column contains either a genotype (0/1/2) or a continuous variable such as a genetic risk score (GRS). Sample ID must be the first column.

ID GRS
1 2.91109682117209
2 3.58725581286855
3 4.0187625426125
4 4.25960701964853
5 3.38657230426473
6 4.27097017945458
7 3.72455637560711

Phenotype file

A file with multiple columns and a header. One row per individual. The column ID contains sample IDs and it doesn’t need to be the first column. All other columns are assumed to be phenotypes coded as 0 or 1, column names are phenotype codes which should be present in the tree file (matching column coding in the tree file).

ID R198 R104 M512 L720 M8414 L031 K802 S662 K267
1 1 1 1 0 0 0 0 0 0 
2 0 0 0 1 0 0 0 0 0 
3 0 0 0 0 1 0 0 0 0 
4 0 0 0 0 0 1 0 0 0 
5 0 0 0 0 0 0 1 1 0 
6 0 1 0 0 0 0 0 0 1 
7 0 0 0 0 0 0 0 0 0 
8 0 0 0 0 0 0 0 0 0 
9 0 0 0 0 0 0 0 0 0

Tree file

File defining the topology of the diagnosis tree. These files can be downloaded from the UK Biobank Showcase website. For example, the tree describing the encoding of Non-cancer Illnesses (data-field 20002) is encoded using Data-Coding 6 of the UK Biobank. This file can be downloaded here. File is assumed to be tab-delimited. A function provided in the package will parse and sort the tree for TreeWAS analysis. The first few lines of the tree for Data-Coding 19 (corresponding to the ICD-10 tree) are shown below.

coding	meaning	node_id	parent_id	selectable
A00	A00 Cholera	286	23	N
A000	A00.0 Cholera due to Vibrio cholerae 01, biovar cholerae	287	286	Y
A001	A00.1 Cholera due to Vibrio cholerae 01, biovar el tor	288	286	Y
A009	A00.9 Cholera, unspecified	289	286	Y
A01	A01 Typhoid and paratyphoid fevers	290	23	N
A010	A01.0 Typhoid fever	291	290	Y
A011	A01.1 Paratyphoid fever A	292	290	Y
A012	A01.2 Paratyphoid fever B	293	290	Y
A013	A01.3 Paratyphoid fever C	294	290	Y

Sample inclusion file

A list of sample IDs to include in the analysis can be parsed to the script with the --keep argument. We assume one sample ID per line.

Running TreeWAS

Three scripts are provided to run TreeWAS analysis with different type of genetic variation and/or genetic models.

Analysing a genetic risk score

The script grs_tree_analysis.R performs TreeWAS analysis on a GRS. The script takes the following arguments:

Argument	Description
sample_file	Sample file. Cannot be null.
pheno_file	Phenotype file. Cannot be null.
tree_file	Tree file. Cannot be null.
outprefix	Prefix to use for results filenames. Defaults to “out”.
theta	Prior on the mutation rate. Defaults to 1/3.
p1	Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keep	Sample inclusion filename. Optional.
num.cores	Number of cores to use. Defaults to 1. If greater than one the `parallel` package will be used.
b1_max_mag	The prior is symmetric around zero. This parameter controls the range of the effect size.
b1_spac	The grid size.

To do a GRS analysis on the test data, use the following command.

./scripts/grs_tree_analysis.R \
    --sample_file=example_data/sample_file_grs.txt \
    --tree_file=example_data/tree_example_ICD10_Chap_VI.txt \
    --pheno_file=example_data/phenotype_file.txt \
    --outprefix=test_grs.res \
    --num.cores=1

Case-control study

The scripts cc_snp_tree_analysis.R and cc_snp_tree_analysis_additive.R perform case-control association analysis. The scripts take the following arguments:

Argument	Description
sample_file	Sample file. Cannot be null.
pheno_file	Phenotype file. Cannot be null.
tree_file	Tree file. Cannot be null.
outprefix	Prefix to use for results filenames. Defaults to “out”.
theta	Prior on the mutation rate. Defaults to 1/3.
p1	Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keep	Sample inclusion filename. Optional.
num.cores	Number of cores to use. Defaults to 1. If greater than one the `parallel` package will be used.
b{1,2}_max_mag	The prior is symmetric around zero. This parameter controls the range of the effect sizes (b1 for the het genotype and b2 for the hom).
b{1,2}_spac	The grid size.

To Run the analysis with the test data fitting an additive model do:

./scripts/cc_snp_tree_analysis_additive.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen.res' \
    --b1_max_mag=2 \
    --b1_spac=0.02 \
    --num.cores=1

or with a full genetic model:

./scripts/cc_snp_tree_analysis.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen2.res' \
    --theta=0.33333 \
    --p1=0.001 \
    --b1_max_mag=3 \
    --b2_max_mag=3 \
    --b1_spac=0.02 \
    --b2_spac=0.02 \
    --num.cores=1

Citation

If you use TreeWAS in your work, please cite us:

Cortes A., et al. (2017) Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. bioRxiv 105122. doi: https://doi.org/10.1101/105122

mcveanlab / treewas Goto Github PK

treewas's Introduction

TreeWAS

Install

Input File Formats

Sample file

Phenotype file

Tree file

Sample inclusion file

Running TreeWAS

Analysing a genetic risk score

Case-control study

Citation

treewas's People

Contributors

Stargazers

Watchers

Forkers

treewas's Issues

Recommend Projects

Recommend Topics

Recommend Org