Git Product home page Git Product logo

intel-gatk4-germline-snps-indels's Introduction

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

This repository contains a few different files - each tuned for certain requirements.

├── 2T_PairedSingleSampleWf_optimized.inputs.json WGS Throughput JSON file
├── 56T_PairedSingleSampleWf_optimized.inputs.json WGS Latency JSON file
├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json WES Latency JSON file
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl WGS WDL optimized for on-prem with cleanup of output results (for throughput analysis)
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl WES WDL optimized for on-prem

Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:

In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify 56T_PairedSingleSampleWf_optimized.inputs.json for tools directory.

FPGA CHANGES

Assuming the environemnt has been setup to offload the pairhmm kernel of HaplotypeCaller to FPGA - the below changes must be enabled in the WDL/JSON files (based on the comments) to make use of the FPGA.

a. In the WDL files, for task Haplotype Caller runtime section, uncomment the line: require_fpga: "yes"

b. In the JSON file, change the "PairedEndSingleSampleWorkflow.gatk_gkl_pairhmm_implementation" to “EXPERIMENTAL_FPGA_LOGLESS_CACHING” from “AVX_LOGLESS_CACHING”.

DATASETS

The datasets for the WGS workflow can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.

Contact Broad/Intel for access to the WES data needed for this workflow.

The other reference files and resource files can be downloaded from:

Datasets Recommended for Setup and Testing this workflow
Data Type  Filename  File Path
Reference
Genome
ref_dict  Homo_sapiens_assembly38.dict https://console.cloud.google.com/storage/browser/broad-references/hg38/v0
ref_fasta  Homo_sapiens_assembly38.fasta
ref_fasta_index  Homo_sapiens_assembly38.fasta.fai
ref_alt  Homo_sapiens_assembly38.fasta.64.alt
ref_sa  Homo_sapiens_assembly38.fasta.64.sa
ref_amb  Homo_sapiens_assembly38.fasta.64.amb
ref_bwt  Homo_sapiens_assembly38.fasta.64.bwt
ref_ann  Homo_sapiens_assembly38.fasta.64.ann
ref_pac  Homo_sapiens_assembly38.fasta.64.pac
contamination_sites_ud Homo_sapiens_assembly38.contam.UD
contamination_sites_bed Homo_sapiens_assembly38.contam.bed
contamination_sites_mu Homo_sapiens_assembly38.contam.mu
Resource
Files
dbSNP_vcf  Homo_sapiens_assembly38.dbsnp138.vcf
dbSNP_vcf_index  Homo_sapiens_assembly38.dbsnp138.vcf.idx
known_snps_sites_vcf Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_snps_sites_vcf_index Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
known_indels_sites_VCFs Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz
known_indels_sites_indices Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
Interval
Files
wgs_calling_interval_list  wgs_calling_regions.hg38.interval_list *SEE NOTE BELOW
wgs_coverage_interval_list  wgs_coverage_regions.hg38.interval_list
wgs_evaluation_interval_list  wgs_evaluation_regions.hg38.interval_list
Small Test
Input
Datasets
flowcell_unmapped_bams H06HDADXX130110.1.ATCACGAT.20k_reads.bam 

https://console.cloud.google.com/storage/browser/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/

H06HDADXX130110.2.ATCACGAT.20k_reads.bam
H06JUADXX130110.1.ATCACGAT.20k_reads.bam

NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.

TOOLS

For on-prem, the workflow uses non-dockerized tools. To keep up with the exact versions released by Broad for their best practices workflow, we download the tools from the docker image to our shared file system.

Run the command:

docker run -v /path/to/shared_filesystem:/path/to/shared_filesystem -it broadinstitute/genomes-in-the-cloud:2.3.1-1504795437 /bin/bash

This command will pull the docker image (if it is not already there locally), and put you within the container from where you can copy the tools needed for the workflow.

root@54754360159e:/usr/gitc# cp -r /usr/local/bin/samtools bwa picard.jar /path/to/shared_filesystem
root@54754360159e:/usr/gitc# exit

Similarly, copy the gatk4 folder from docker image: broadinstitute/gatk:4.0.0.0

intel-gatk4-germline-snps-indels's People

Contributors

aprabhak2 avatar knoblett avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.