Git Product home page Git Product logo

tobiasrausch / vc Goto Github PK

View Code? Open in Web Editor NEW
15.0 3.0 1.0 99 KB

A tutorial on structural variant calling for short read sequencing data

Home Page: https://www.youtube.com/watch?v=zO9WCOaq3aQ

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 14.00% Makefile 33.20% Shell 7.29% R 45.51%
variant ngs alignment mutations gear-genomics delly structural-variation genomics-visualization rearrangements

vc's Introduction

Structural variant calling tutorial using delly.

Installation - Linux

Installation of required libraries depends on your linux distribution, e.g., for Ubuntu:

apt-get update

apt-get install -y autoconf build-essential cmake g++ git libcurl4-gnutls-dev libbz2-dev libdeflate-dev libgl1-mesa-dev libncurses-dev liblzma-dev pkg-config zlib1g-dev

Then clone the repository and install all dependencies using mamba:

git clone --recursive https://github.com/tobiasrausch/vc

cd vc

make all

Download the course data:

make FILE=1nxzjQFt33ch1P_nWF66q1tkHv5gxTaKb download

Installation - Mac OS

Load the conda environment with all required tools

conda env create -f environment.yml

Download and unpack the course data

cd data/ && gdown 1nxzjQFt33ch1P_nWF66q1tkHv5gxTaKb && tar -xzf sv.tar.gz && rm sv.tar.gz

Load the software stack

Load the mamba environment with all required tools

if [ ! -z ${CONDA_PREFIX+x} ]; then conda deactivate; fi
export PATH=`pwd`/mamba/bin:${PATH}

SV Calling

Discovery of chromothripsis in a cancer sample

In this practical we will analyze germline and somatic structural variants (SVs) of a chromothripsis sample from this cancer study. The anonymized data was filtered for chr2 to speed up all subsequent analysis. The tumor genome alignment file is named tumor.bam and the control genome alignment file is named control.bam.

Structural variant alignment quality control

Before each discovery of structural variants, you should assess the quality of the data, as, for example, paired-end methods are hampered by skewed insert size distributions, read-depth methods by uneven coverage, and split-read methods by high sequencing error rates. Common quality criteria are the percentage of reads mapped, the duplicate rate, number of properly paired reads and the shape of the insert size and coverage distributions. Picard, SAMtools, FastQC and Alfred are commonly used quality control tools that compute some of these alignment statistics as shown below for the tumor sample.

cd data/sv/
samtools flagstat tumor.bam
alfred qc -r chr2.fa -o qc.tsv.gz -j qc.json.gz tumor.bam
zcat qc.tsv.gz | grep ^ME | datamash transpose

Instead of parsing the tab-delimited file, you can also upload the JSON file qc.json.gz to the Alfred web application available on gear-genomics.com.

As you can see from the QC results, the data has been downsampled to 7x coverage to speed up all analyses. This implies that some SVs will have only weak support due to low coverage. In terms of QC interpretation, there are some general things to look out for, such as mapping percentages below 70%, >20% duplicates, or multiple peaks in the insert size distribution. Please note that many alignment statistics vary greatly depending on the protocol used, so it's usually best to compare several different sequencing runs from the same protocol (DNA-seq, RNA-seq, ChIP-seq, paired-end, single-end, or mate-pair) to highlight outliers.

Exercises

  • What is the median coverage of the data set?
  • Given the insert size distribution, what would be a suitable cutoff to define deletion supporting paired-ends?

Germline Structural Variants

Before we dive into SV calling, let's get an idea of what structural variants (SVs) look like in short-read sequencing data. For this I have prepared a BED file with some "simple" germline structure variants like deletions and some more complex examples.

cat svs.bed

Using IGV we can browse these SVs interactively.

igv -g chr2.fa

Once IGV has started use 'File' and 'Load from File' to load the tumor.bam and control.bam alignment file. Then import the svs.bed file from your working directory using 'Regions' and 'Import Regions'. You can then easily navigate to the structural variants with 'Regions' and 'Region Navigator'. Select a structural variant in the region navigator and click 'View', which will center the IGV alignment view on the selected structural variant. You can zoom in and out using the '+' and '-' signs in the toolbar at the top. To highlight the abnormal paired-ends please right click in IGV on the BAM file and activate 'View as pairs'. In the same menu, please open 'Color alignments by' and then switch to "pair orientation' for inversions and duplications. For deletions, you want to color the alignments by "insert size".

Plotting structural variants

IGV is excellent for interactive browsing but for large numbers of SVs you can use command-line tools such as wally to plot multiple SVs in batch.

wally region -R svs.bed -cp -g chr2.fa tumor.bam control.bam

Complex structural variants

Even in germline genomes we can observe complex SVs and two example regions are in the svs.bed file.

cat svs.bed | grep "complex"

As part of the 1000 Genomes SV consortium we validated some of the above complex SVs using PacBio. The reads are in a separate FASTA file called pacbio.sv1.fa and pacbio.sv2.fa. We need the subsequence of the reference to create a pairwise dotplot of the PacBio read against the reference. SAMtools is a convenient tool to extract such subsequences of a FASTA file.

samtools faidx chr2.fa chr2:18905691-18907969 | sed 's/^>.*$/>reference/' > sv1.fa
samtools faidx chr2.fa chr2:96210505-96212783 | sed 's/^>.*$/>reference/' > sv2.fa

Please create a dot plot of the above genomic reference subsequences sv1.fa and sv2.fa against the respective PacBio read pacbio.sv1.fa and pacbio.sv2.fa using wally.

cat pacbio.sv1.fa >> sv1.fa
wally dotplot sv1.fa
cat pacbio.sv2.fa >> sv2.fa
wally dotplot sv2.fa

Exercises

  • What kind of SV is present in the region chr2:18905691-18907969 ?
  • What kind of SV is present in the region chr2:96210505-96212783 ?

Delly structural variant calling

Delly is a method for detecting structural variants. Using the tumor and normal genome alignment, delly calculates structural variants and outputs them as a BCF file, the binary encoding of VCF. You can also provide a text file with regions to exclude from structural variant analysis. Delly's default exclusion map removes the telomeric and centromeric regions of all human chromosomes because these repetitive regions cannot be analyzed with short-read data.

delly call -q 20 -g chr2.fa -x hg19.ex -o sv.bcf tumor.bam control.bam

VCF encoding of structural variants

VCF was originally designed for short variants and that's why all SV callers heavily use the VCF INFO fields to encode additional information about the SV such as the structural variant end (INFO:END) and the SV type (INFO:SVTYPE). You can look at the header of the BCF file using grep where '-A 2' includes the first two structural variant records after the header in the file:

bcftools view sv.bcf | grep "^#" -A 2

Delly uses the VCF:INFO fields for structural variant site information, such as how confident the structural variant prediction is and how accurate the breakpoints are. The genotype fields contain the actual sample genotype, its genotype quality and genotype likelihoods and various count fields for the variant and reference supporting reads and spanning pairs. If you browse through the VCF file you will notice that a subset of the Delly structural variant predictions have been refined using split-reads. These precise variants are flagged in the VCF info field with the tag 'PRECISE', all others are listed as 'IMPRECISE'. Please note that this BCF file contains germline and somatic structural variants but also false positives caused by repeat-induced mis-mappings or incomplete reference sequences.

Querying VCF files

Bcftools offers many possibilities to query and reformat SV calls. For instance, to output a table with the chromosome, start, end, identifier and genotype of each SV we can use:

bcftools query -f "%CHROM\t%POS\t%INFO/END\t%ID[\t%GT]\n" sv.bcf | head

This initial SV calling cannot differentiate somatic and germline structural variants. For instance, the 2 complex variants we looked at before are still present in the delly output. A proximal duplication causes 2 paired-end signatures (deletion-type and duplication-type):

bcftools view sv.bcf chr2:18905691-18907969 | awk '$2>=18905691 && $2<=18907969'

Exercises

  • What is the fraction of deletions that has been called precisely (at single nucleotide resolution) by Delly?
  • Is the other complex structural variant still present in Delly's output file?

Somatic structural variant filtering

Delly's somatic filtering requires a sample file listing tumor and control sample names from the VCF file.

cat spl.tsv

There are many parameters available to tune the somatic structural variant filtering. Below we require a minimum variant allele frequency of 25%, no support in the matched normal and an overall confident structural variant site prediction with the VCF filter field being equal to PASS.

delly filter -p -f somatic -o somatic.bcf -a 0.25 -s spl.tsv sv.bcf

As expected, the somatic SVs have a homozygous reference genotype in the control sample.

bcftools query -f "%CHROM\t%POS\t%INFO/END\t%ID[\t%GT]\n" somatic.bcf

Exercises

  • What is the average size of the somatic SVs?
  • Is there any SV type that is enriched among the somatic SVs?

Complex structural variant visualization

IGV and wally support so-called split views to visualize the breakpoints of long-range SVs greater than 10,000kbp.

bcftools query -f "%CHROM\t%POS\t%INFO/END\t%ID\n" somatic.bcf | awk '$3-$2>10000 {print $1"\t"($2-500)"\t"($2+500)"\t"$4"L\n"$1"\t"($3-500)"\t"($3+500)"\t"$4"R";}' > somatic.bp.bed
wally region -R somatic.bp.bed -s 2 -cp -g chr2.fa tumor.bam control.bam

Such SV breakpoint views cannot reveal any higher-order SV class such as chromothripsis so we need to integrate read-depth with structural variant predictions to get a better overview of complex somatic rearrangements. Let's first create a simple read-depth plot.

delly cnv -u -z 10000 -o cnv.bcf -c cnv.cov.gz -g chr2.fa -m chr2.map.fa tumor.bam 
Rscript cnBafSV.R cnv.cov.gz

Now we can overlay the somatic structural variants on top of the read-depth information.

bcftools query -f "%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE\t%ID\n" somatic.bcf > svs.tsv
Rscript cnBafSV.R cnv.cov.gz svs.tsv

Lastly, we can augment the plots with the allelic depth of SNPs. For amplifications, we would expect the variant allele frequencies to deviate from the expected 50% for heterozygous variants. To speed up things, we only compute SNPs in the region 1 - 50Mbp.

bcftools mpileup -d 50 -r chr2:1-50000000 -a FORMAT/AD -f chr2.fa tumor.bam | bcftools call -mv -Ob -o calls.bcf
bcftools view -g het -m2 -M2 -v snps calls.bcf | bcftools query -f "%POS,[%AD\n]" - | awk 'BEGIN {FS=","} $2+$3>10 {print $1"\t"$3/($2+$3);}' > baf.tsv
Rscript cnBafSV.R cnv.cov.gz svs.tsv baf.tsv 

vc's People

Contributors

maalbadri avatar tobiasrausch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tliu76

vc's Issues

some questions

Hi, @tobiasrausch
Thank you for your work and I have some questions about this SV pipline:

coral call
  1. Where can I find this software coral ?
cnBafSV.R
  1. Where can I find this code?

Thank you very much.
Xiucz.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.