The intel-gatk4-germline-snps-indels from ambarishk

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

This repository contains a few different files - each tuned for certain requirements.

├── 2T_PairedSingleSampleWf_optimized.inputs.json → WGS Throughput JSON file
├── 56T_PairedSingleSampleWf_optimized.inputs.json → WGS Latency JSON file
├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json → WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json → WES Latency JSON file
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl → WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl → WGS WDL optimized for on-prem with cleanup of output results (for throughput analysis)
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl → WES WDL optimized for on-prem

Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:

In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify 56T_PairedSingleSampleWf_optimized.inputs.json for tools directory.

FPGA CHANGES

Assuming the environemnt has been setup to offload the pairhmm kernel of HaplotypeCaller to FPGA - the below changes must be enabled in the WDL/JSON files (based on the comments) to make use of the FPGA.

a. In the WDL files, for task Haplotype Caller runtime section, uncomment the line: require_fpga: "yes"

b. In the JSON file, change the "PairedEndSingleSampleWorkflow.gatk_gkl_pairhmm_implementation" to “EXPERIMENTAL_FPGA_LOGLESS_CACHING” from “AVX_LOGLESS_CACHING”.

DATASETS

The datasets for the WGS workflow can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.

Contact Broad/Intel for access to the WES data needed for this workflow.

The other reference files and resource files can be downloaded from:

Datasets Recommended for Setup and Testing this workflow

Data Type		Filename	File Path
Reference Genome	ref_dict	Homo_sapiens_assembly38.dict	https://console.cloud.google.com/storage/browser/broad-references/hg38/v0
	ref_fasta	Homo_sapiens_assembly38.fasta
	ref_fasta_index	Homo_sapiens_assembly38.fasta.fai
	ref_alt	Homo_sapiens_assembly38.fasta.64.alt
	ref_sa	Homo_sapiens_assembly38.fasta.64.sa
	ref_amb	Homo_sapiens_assembly38.fasta.64.amb
	ref_bwt	Homo_sapiens_assembly38.fasta.64.bwt
	ref_ann	Homo_sapiens_assembly38.fasta.64.ann
	ref_pac	Homo_sapiens_assembly38.fasta.64.pac
	contamination_sites_ud	Homo_sapiens_assembly38.contam.UD
	contamination_sites_bed	Homo_sapiens_assembly38.contam.bed
	contamination_sites_mu	Homo_sapiens_assembly38.contam.mu
Resource Files	dbSNP_vcf	Homo_sapiens_assembly38.dbsnp138.vcf
	dbSNP_vcf_index	Homo_sapiens_assembly38.dbsnp138.vcf.idx
	known_snps_sites_vcf	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
	known_snps_sites_vcf_index	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
	known_indels_sites_VCFs	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
	known_indels_sites_VCFs	Homo_sapiens_assembly38.known_indels.vcf.gz
	known_indels_sites_indices	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
	known_indels_sites_indices	Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
Interval Files	wgs_calling_interval_list	wgs_calling_regions.hg38.interval_list ^{*SEE NOTE BELOW}
	wgs_coverage_interval_list	wgs_coverage_regions.hg38.interval_list
	wgs_evaluation_interval_list	wgs_evaluation_regions.hg38.interval_list
Small Test Input Datasets	flowcell_unmapped_bams	H06HDADXX130110.1.ATCACGAT.20k_reads.bam	https://console.cloud.google.com/storage/browser/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/
		H06HDADXX130110.2.ATCACGAT.20k_reads.bam
		H06JUADXX130110.1.ATCACGAT.20k_reads.bam

NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.

TOOLS

For on-prem, the workflow uses non-dockerized tools. To keep up with the exact versions released by Broad for their best practices workflow, we download the tools from the docker image to our shared file system.

Run the command:

docker run -v /path/to/shared_filesystem:/path/to/shared_filesystem -it broadinstitute/genomes-in-the-cloud:2.3.1-1504795437 /bin/bash

This command will pull the docker image (if it is not already there locally), and put you within the container from where you can copy the tools needed for the workflow.

root@54754360159e:/usr/gitc# cp -r /usr/local/bin/samtools bwa picard.jar /path/to/shared_filesystem
root@54754360159e:/usr/gitc# exit

Similarly, copy the gatk4 folder from docker image: broadinstitute/gatk:4.0.0.0

ambarishk / intel-gatk4-germline-snps-indels Goto Github PK

intel-gatk4-germline-snps-indels's Introduction

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

FPGA CHANGES

DATASETS

TOOLS

intel-gatk4-germline-snps-indels's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent