Watershed-SV [Construction in progress]

Watershed-SV extends Watershed to model the impact of rare SVs (DUP, DEL, DUP-CNV, DEL-CNV, INV, INS) on nearby gene expressions outlier. For running Watershed model, please refer to Watershed GitHub. This repository contains:

the pipeline and associated scripts used for generating structural variations (rare and common) annotations with respect to nearby genes.
the scripts for generating expression outliers.
the approach to merge annotations AND the expression outliers, and finally format data into desired format for evaluate_watershed.R and predict_watershed.R.

Instructions for collecting annotations.

Watershed-SV pipeline currently uses bash script for simplicity. The key scripts are scripts/executable_scripts/generate_annotations.sh and scripts/executable_scripts/generate_annotations_ABC.sh depending on whether you want to run region-agnostic model (10kb model in the paper) or region-aware model (100kb model in the paper). To replicate the environment for collecting annotations, see: WatershedSV.yml.

Parameter breakdown:

-p | --pipeline: which pipeline to use, select from population, smallset. If data is of sufficient size, ie > 100, select population, allowing for option --filter-ethnicity, --filter rare. Otherwise, select smallset.
-v | --input-vcf: input vcf file, it has to have at least 1 sample column. We only consider SVTYPEs: DUP, DEL, DUP_CNV, DEL_CNV, CNV, INS, INV.
-f | --filters: if variant record in vcf have these filters, keep for further analysis.
-k | --flank: how much flanking up and downstream of genes to consider. usually use 100000, 10000.
-r | --rareness: rareness, if --filter-rare == True, then this is the MAF threshold to set to filter for rare variants, 0.01 recommended or lower.
-l | --liftover-bed: if you have a crossmap/liftover SV coordinate you want to use, ie, if VCF is in older build, you lifted over coordinates to HG38, then provide the bed file in addition to original VCF to convert coordinates.
-o | --outdir: output directory name for annotations
-b | --genome-bound-file: a file depicting the chromosome/contig name, start and end coordinates.
-g | --gencode-genes: gencode transcript model file.
-c | --vep-cache-dir: vep_cache_dir for running vep annotations. we recommend setting up vep offline to run our pipeline smoothly.
-a | --metadata: metadata file for filtering ethnicity. In our case, training data is GTEx, we used GTEx metadata file from dbGaP.
-e | --filter-ethnicity: filter by ethnicity? GTEx relic, True to only train on EUR individuals.
-i | --filter-rare: filter rare variants if using population model

jasonbhn / watershed-sv Goto Github PK

watershed-sv's Introduction

Watershed-SV [Construction in progress]

Instructions for collecting annotations.

Parameter breakdown:

Evaluating Watershed-SV model against WGS-only model

Using Watershed-SV model learned on a reference dataset to prioritize gene-SV pairs in an independent test dataset.

watershed-sv's People

Contributors

Watchers

watershed-sv's Issues

High medium low processing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent