This repository contains a few different files - each tuned for certain requirements.
├── 2T_PairedSingleSampleWf_optimized.inputs.json → WGS Throughput JSON file
├── 56T_PairedSingleSampleWf_optimized.inputs.json → WGS Latency JSON file
├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json → WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json → WES Latency JSON file
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl → WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl → WGS WDL optimized for on-prem with cleanup of output results (for throughput analysis)
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl → WES WDL optimized for on-prem
Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:
- PairedSingleSampleWf_noqc_nocram_optimized.wdl
- PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl
- Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl
In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify 56T_PairedSingleSampleWf_optimized.inputs.json for tools directory.
Assuming the environemnt has been setup to offload the pairhmm kernel of HaplotypeCaller to FPGA - the below changes must be enabled in the WDL/JSON files (based on the comments) to make use of the FPGA.
a. In the WDL files, for task Haplotype Caller runtime section, uncomment the line: require_fpga: "yes"
b. In the JSON file, change the "PairedEndSingleSampleWorkflow.gatk_gkl_pairhmm_implementation" to “EXPERIMENTAL_FPGA_LOGLESS_CACHING” from “AVX_LOGLESS_CACHING”.
The datasets for the WGS workflow can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.
Contact Broad/Intel for access to the WES data needed for this workflow.
The other reference files and resource files can be downloaded from:
Data Type | Filename | File Path | |
Reference Genome |
ref_dict | Homo_sapiens_assembly38.dict | https://console.cloud.google.com/storage/browser/broad-references/hg38/v0 |
ref_fasta | Homo_sapiens_assembly38.fasta | ||
ref_fasta_index | Homo_sapiens_assembly38.fasta.fai | ||
ref_alt | Homo_sapiens_assembly38.fasta.64.alt | ||
ref_sa | Homo_sapiens_assembly38.fasta.64.sa | ||
ref_amb | Homo_sapiens_assembly38.fasta.64.amb | ||
ref_bwt | Homo_sapiens_assembly38.fasta.64.bwt | ||
ref_ann | Homo_sapiens_assembly38.fasta.64.ann | ||
ref_pac | Homo_sapiens_assembly38.fasta.64.pac | ||
contamination_sites_ud | Homo_sapiens_assembly38.contam.UD | ||
contamination_sites_bed | Homo_sapiens_assembly38.contam.bed | ||
contamination_sites_mu | Homo_sapiens_assembly38.contam.mu | ||
Resource Files |
dbSNP_vcf | Homo_sapiens_assembly38.dbsnp138.vcf | |
dbSNP_vcf_index | Homo_sapiens_assembly38.dbsnp138.vcf.idx | ||
known_snps_sites_vcf | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz | ||
known_snps_sites_vcf_index | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi | ||
known_indels_sites_VCFs | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz | ||
Homo_sapiens_assembly38.known_indels.vcf.gz | |||
known_indels_sites_indices | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi | ||
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi | |||
Interval Files |
wgs_calling_interval_list | wgs_calling_regions.hg38.interval_list *SEE NOTE BELOW | |
wgs_coverage_interval_list | wgs_coverage_regions.hg38.interval_list | ||
wgs_evaluation_interval_list | wgs_evaluation_regions.hg38.interval_list | ||
Small Test Input Datasets |
flowcell_unmapped_bams | H06HDADXX130110.1.ATCACGAT.20k_reads.bam | |
H06HDADXX130110.2.ATCACGAT.20k_reads.bam | |||
H06JUADXX130110.1.ATCACGAT.20k_reads.bam |
NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list
is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.
For on-prem, the workflow uses non-dockerized tools. To keep up with the exact versions released by Broad for their best practices workflow, we download the tools from the docker image to our shared file system.
Run the command:
docker run -v /path/to/shared_filesystem:/path/to/shared_filesystem -it broadinstitute/genomes-in-the-cloud:2.3.1-1504795437 /bin/bash
This command will pull the docker image (if it is not already there locally), and put you within the container from where you can copy the tools needed for the workflow.
root@54754360159e:/usr/gitc# cp -r /usr/local/bin/samtools bwa picard.jar /path/to/shared_filesystem
root@54754360159e:/usr/gitc# exit
Similarly, copy the gatk4 folder from docker image: broadinstitute/gatk:4.0.0.0