scilifelab / sarek Goto Github PK

This project forked from nf-core/sarek

Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing

License: MIT License

Python 14.71% R 26.63% Groovy 2.31% Shell 6.30% Awk 1.42% Nextflow 47.89% Dockerfile 0.74%

nextflow workflow pipeline bioinformatics genomics cancer next-generation-sequencing docker singularity reproducible-science reproducible-research containers germline germline-variants somatic somatic-variants

sarek's Introduction

This pipeline has moved!

This pipeline has been moved to the new nf-core project. You can now find it here:

https://github.com/nf-core/sarek

If you have any problems with the pipeline, please create an issue at the above repository instead.

To find out more about nf-core, visit http://nf-co.re/

This repository will be archived to maintain the released versions for future reruns, in the spirit of full reproducibility.

If you have any questions, please get in touch: [email protected]

// Maxime Garcia, 2020-01-27

An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing

Introduction

Previously known as the Cancer Analysis Workflow (CAW), Sarek is a workflow designed to run analyses on WGS data from regular samples or tumour / normal pairs, including relapse samples if required.

It's built using Nextflow, a domain specific language for workflow building. Software dependencies are handled using Docker or Singularity - container technologies that provide excellent reproducibility and ease of use. Singularity has been designed specifically for high-performance computing environments. This means that although Sarek has been primarily designed for use with the Swedish UPPMAX HPC systems, it should be able to run on any system that supports these two tools.

Sarek was developed at the National Genomics Infastructure and National Bioinformatics Infastructure Sweden which are both platforms at SciLifeLab. It is listed on the Elixir - Tools and Data Services Registry.

Workflow steps

Sarek is built with several workflow scripts. A wrapper script contained within the repository makes it easy to run the different workflow scripts as a single job. To test your installation, follow the tests documentation.

Raw FastQ files or aligned BAM files (with or without realignment & recalibration) can be used as inputs. You can choose which variant callers to use, plus the pipeline is capable of accommodating additional variant calling software or CNV callers if required.

The worflow steps and tools used are as follows:

Preprocessing - main.nf (based on GATK best practices)
- Map reads to Reference
  - BWA
- Mark Duplicates
  - GATK MarkDuplicates
- Base (Quality Score) Recalibration
  - GATK BaseRecalibrator
  - GATK ApplyBQSR
Germline variant calling - germlineVC.nf
- SNVs and small indels
  - GATK HaplotyeCaller
  - Strelka2
- Structural variants
  - Manta
Somatic variant calling - somaticVC.nf (optional)
- SNVs and small indels
- Structural variants
  - Manta
- Sample heterogeneity, ploidy and CNVs
  - ASCAT
Annotation - annotate.nf (optional)
- Variant annotation
  - SnpEff
  - VEP (Variant Effect Predictor)
Reporting - runMultiQC.nf
- Reporting
  - MultiQC

Documentation

The Sarek pipeline comes with documentation in the docs/ directory:

Contributions & Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on Gitter or contact us: [email protected], [email protected]

CHANGELOG

CHANGELOG

Credits

Main authors:

Helpful contributors:

sarek's People

Contributors

Stargazers

Watchers

Forkers

genomixer wisekh6 oliviamikaela chulingding satchellhong hanx-iao rptashkin

sarek's Issues

Deal with intermediate files

Use recalibrated files

For the DREAM challenge we have recalibrated files already. We need an option to use recalibrated files to run benchmarks.

VarDictJava

Problem with VarDictJava and out test data.
It doesn't even want to launch
I I believe the help, on problem is the -z 1 parameters

-z 0/1
   Indicate whether is zero-based coordinates, as IGV does.  Default: 1 for BED file or amplicon BED file.  Use 0 to turn it off.
   When use -R option, it's set to 0

Because we're using it with -R chr17:1000000-1100000

But when I remove the -z 1
I get a lot of errors everywhere

Use of uninitialized value $exon in substitution (s///) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 567.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substr at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substitution (s///) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 567.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substr at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.

and then a whole lot of bunch of

Use of uninitialized value in join or string at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 1489.
Use of uninitialized value in join or string at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 1490.

Watch and correct efficency

I had a look at your hours and it looks like you are running on about 50% efficiency overall. So half the hours i give you will more or less not be used. It looks like many of the nextflow scripts you run should be modified, and told to run on a single core instead of a whole node. For example the mergeBam jobs, and the markduplicates should probably only run on 1-2 cores as well. Please have a look at this.

ASCAT

Get ASCAT to work with .bam files as input, and call it in Nextflow

Config file

Add a config file, alas https://github.com/ewels/NGI-RNAseq/blob/master/example_uppmax_config
specifying resources on local/cluster

Use GATK merge

Instead of samtools: the point is that older samtools versions that are on some platforms are treating read groups incorrectly. So have to have use MergeSamFiles from picard instead.

Issues regarding milou - use nextflow configuration profiles

For some reason running things milou is rather slow. Apparently it is due to the wrong profile I was using. To make testing easier we should use config profiles (https://www.nextflow.io/docs/latest/config.html#config-profiles).

Merge VCF from variant callers

Add a merging script, merging vcf iles from Mutect1, Strelka and VarDict into one final vcf file

problem with fastq files

instrument IDs to be fixed on :
HCC1954.normal_S22_L001_R1_001.fastq.gz
HCC1954.normal_S22_L001_R2_001.fastq.gz
HCC1954.tumor2_S24_L002_R1_001.fastq.gz
HCC1954.tumor2_S24_L002_R2_001.fastq.gz
HCC1954.tumor3_S25_L001_R1_001.fastq.gz
HCC1954.tumor3_S25_L001_R2_001.fastq.gz

Setting up things on irma

Tools versions are different on irma

Add Strelka as a variant caller using pip

We can not use docker to export our environment, so use pip (that can be easily added to conda as well) to pull versioned software (in this case Strelka).

Make repository public

Remove contig hs37d5 from realignment/preprocessing

The contig hs37d5 is a decoy containing concatenated virus and similar sequences: http://www.cureffi.org/2013/02/01/the-decoy-genome/ Preprocessing it only eats up CPU and disk.

Survey storage requirements

Refactor VCF gather

Write a comprehensive README

The README.md should contain a guide about stages, how to start preprocessing and variant call without preprocessing, etc.

TCGA benchmark S1

Start benchmarking on the simplest dataset S1 for pure normal/tumor data

ASCAT presentation

Add MuTect1

Seems MuTect2 is in beta, and will be for a while. To get SNPs, we have to add MuTect1

Simulate SV data for validation

To validate Manta/Delly/Lumpy we need a set of known structural variants. To make it relatively fast, first we need a data set with known variants. First we include deletion, insertion, inversion, translocation and duplication for a small reference:

three chromosomes only
each from HG19
each chromosome is a ~1Mb subset
reads are simulated either by wgsim or stampy

Update cosmic version

Updated cosmic database version from v54 to v74 in /pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37 by adding b37_cosmic_v74.vcf and b37_cosmic_v74.vcf.idx

Change pointer in milou.config to new files.

Germline caller

Strelka has problem with the reference index

Tried to run Strelka on our test samples, and run into this error:
ERROR: Unexpected format for line number '1' of fasta index file: '/pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37/human_g1k_v37_decoy.fasta.fai' Re-running fasta indexing may fix the issue. To do so, run: "samtools faidx /pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37/human_g1k_v37_decoy.fasta"

Fix version on all tools call

make sure we include version nr in all modules loaded

Check UNIX pipes

We are using pipes (i.e. in Mapping) and when they fail at the beginning, only the last status is picked up by nextflow. Better to check PIPESTATUS and fail early.

FASTQ or BAM as input

Remove hard-coded parameters

Reporting some issues I've had with the workflow.

First: Line 1086 of MultiFQtoVC.nf should probably be erased.
https://github.com/SciLifeLab/CAW/blob/master/MultiFQtoVC.nf#L1086

Then, would it be possible to make all the

    cpus 8
    memory { 16.GB * task.attempt }
    time { 20.h * task.attempt }
    errorStrategy { task.exitStatus == 143 ? 'retry' : 'terminate' }
    maxRetries 3
    maxErrors '-1'

for example: https://github.com/SciLifeLab/CAW/blob/master/MultiFQtoVC.nf#L324-L329
and update it to a nodeSetup() function, or so?

So that it is not hard-coded.

Or something like

    cpus $nodeCPUs
    memory { $nodeRAM * task.attempt }
    time { $nodeTimeSlice * task.attempt }
    errorStrategy { task.exitStatus == 143 ? 'retry' : 'terminate' }
    maxRetries 3
    maxErrors '-1'

Set of variant callers integrated

List of somatic callers at https://omictools.com/somatic-snp-detection-category

Add annotation to calls (VEP/snpEff/whatever)

we should ask vcfanno to be added on milou
cf #10

merge vcfs (that need to be merged (ie not HaplotypeCaller..))
run snpEff
run VEP

DREAM Challenge runs

These runs are to get results from Strelka and MuTect1 from all the DREAM datasets

Use GATK 3.6 only

Now we are using different versions for calibration/realignment and for variant call (HaplotypeCaller and MuTect2) . The v3.6 jar should contain all these tools, we should use only that and the variabla gatkHome (not mutect2Home)

Make an error if a file is missing in the tsv file

I moved my .tsv file, so the links to the fastq files where not right anymore, but the workflow managed to work untill the CreateIntervals process.
So we should definitively add a verification on the files in the .tsv file.

Test Strelka on WGS data

Integrating and validating Manta

TM data tests

Skipping IndelRealignment

According to multiple blogposts, indel realignment is not necessary for many variant callers by now. We should investigate once there are a couple of VCs integrated.

Make a variant calling nextflow script that use MuTect1.nf

Assemble new test data based on TCGA benchmark

Find suitable samples/files from the TCGA dataset that can be used as a defined test set. It should include a normal sample and several tumor samples from the same individual. We should split the fastq files into multiple files per sample, and rename them so that they mimic the file structure at NGI. Based on this we will then parse the read group information from the fastq file name according to the description in CAW/doc/files_and_folder_info.md.

The full TCGA data set is described here: http://hgwdev.soe.ucsc.edu/~ewingad/benchmark4/instructions.pdf

The HCC1954 tumor sample has a mutation in TP53, which could be good "test case".
This somatic mutation was detected in a previous MuTect1 run (normal in the first genotype column, tumor in the second)
7578442 rs148924904 T C . PASS AC=5;AN=11;DB;SF=0,1;SOMATIC;VT=SNP GT:BQ:DP:FA:SS:AD 0:.:87:0.00:0:86,0 0/1:34:59:1.00:2:0,59

Good 100 000 bp region to extract is: chr17:7528442-7628442

Data will be extracted from the following bam files:
"normal"
/sw/data/uppnex/ToolBox/TCGAbenchmark/34c9ff85-c2f8-45dc-b4aa-fba05748e355/G15512.HCC1954_BL.1.bam
"tumor1"
/sw/data/uppnex/ToolBox/TCGAbenchmark/6d8044f7-3f63-487c-9191-adfeed4e74d3/G15512.HCC1954.1.bam
"tumor2" (60% tumor, 40% normal)
/sw/data/uppnex/ToolBox/TCGAbenchmark/02d8b3de-b043-4bfa-9130-adc18195313f/HCC1954.mix1.n40t60.bam
"tumor3" (20% tumor, 80% normal)
/sw/data/uppnex/ToolBox/TCGAbenchmark/360b4736-6c5e-48df-af58-c1cf51609350/HCC1954.mix1.n80t20.bam

Data extraction
Extract bam files for the above region:

samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/34c9ff85-c2f8-45dc-b4aa-fba05748e355/G15512.HCC1954_BL.1.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.normal
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/6d8044f7-3f63-487c-9191-adfeed4e74d3/G15512.HCC1954.1.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor1
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/02d8b3de-b043-4bfa-9130-adc18195313f/HCC1954.mix1.n40t60.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor2
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/360b4736-6c5e-48df-af58-c1cf51609350/HCC1954.mix1.n80t20.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor3

Convert to fastq:

bedtools bamtofastq -i HCC1954.part.normal.bam -fq HCC1954.part.normal_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor1.bam -fq HCC1954.part.tumor1_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor2.bam -fq HCC1954.part.tumor2_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor3.bam -fq HCC1954.part.tumor3_R1.fastq -fq2 HCC1954.part.normal_R2.fastq

Split each fastq into two smaller files that mimic "lanes", with file names that it looks like what comes out of NGI:
sample_index_lane_R1_001.fastq.gz
for example
5-D-91-01_S22_L005_R1_001.fastq.gz
Where
5-D-91-01 = sample
S22 = index
L005 = lane
and the ending "_001.fastq.gz" should always be there

The fastq files have these number of rows:
HCC1954.part.normal_R1.fastq 20012 (same for HCC1954.part.normal_R2.fastq)
HCC1954.part.tumor1_R1.fastq 13116
HCC1954.part.tumor2_R1.fastq 6592
HCC1954.part.tumor3_R1.fastq 7316

head -n 10000 HCC1954.part.normal_R1.fastq > HCC1954.normal_S22_L001_R1_001.fastq
head -n 10000 HCC1954.part.normal_R2.fastq > HCC1954.normal_S22_L001_R2_001.fastq 
tail -n 10012 HCC1954.part.normal_R1.fastq >  HCC1954.normal_S22_L002_R1_001.fastq 
tail -n 10012 HCC1954.part.normal_R2.fastq >  HCC1954.normal_S22_L002_R2_001.fastq 

head -n 7000 HCC1954.part.tumor1_R1.fastq > HCC1954.tumor1_S23_L001_R1_001.fastq 
head -n 7000 HCC1954.part.tumor1_R2.fastq > HCC1954.tumor1_S23_L001_R2_001.fastq
tail -n 6116 HCC1954.part.tumor1_R1.fastq >  HCC1954.tumor1_S23_L002_R1_001.fastq 
tail -n 6116 HCC1954.part.tumor1_R2.fastq >  HCC1954.tumor1_S23_L002_R2_001.fastq 

head -n 3296 HCC1954.part.tumor2_R1.fastq > HCC1954.tumor2_S24_L001_R1_001.fastq 
head -n 3296 HCC1954.part.tumor2_R2.fastq > HCC1954.tumor2_S24_L001_R2_001.fastq
tail -n 3296 HCC1954.part.tumor2_R1.fastq >  HCC1954.tumor2_S24_L002_R1_001.fastq 
tail -n 3296 HCC1954.part.tumor2_R2.fastq >  HCC1954.tumor2_S24_L002_R2_001.fastq 

head -n 3600 HCC1954.part.tumor3_R1.fastq > HCC1954.tumor3_S25_L001_R1_001.fastq 
head -n 3600 HCC1954.part.tumor3_R2.fastq > HCC1954.tumor3_S25_L001_R2_001.fastq
tail -n 3716 HCC1954.part.tumor3_R1.fastq >  HCC1954.tumor3_S25_L002_R1_001.fastq 
tail -n 3716 HCC1954.part.tumor3_R2.fastq >  HCC1954.tumor3_S25_L002_R2_001.fastq

Finally
gzip *_001.fastq

The final gziped files are stored on Milou in
/sw/data/uppnex/ToolBox/TCGAbenchmark/chr17_multilane/

Strelka - Intervals

Take care of Strelka intervals (for the moment getting rid of it).
Maybe later try to think about slicing the bam files...
cf https://groups.google.com/forum/#!msg/strelka-discuss/hqWwcDZgB-Y/Ysg_0ip0O0QJ

Annotation of results

Select/implement a set of software that are annotating the result VCFs before ranking. Like:

snpEff
vcfanno

Check MuTect2 DREAM state

Get rid of warning Duplicate output channel name

WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
tiny is the patientId
The warning message is not very helpful, so I don't know where this problem came from, so still looking...

Process arbitrary number of fastq input files

We may have to process a number of fastq pairs.

pair sample (from multiplexing)
multiple samples: normal, tumor, tumor 2 ... tumor N

Probably need a channel of fastqpairs as input
need to keep track of read group (RG) info and tumor/normal status...

remove threading from MuTect2.

MuTect2 has a bug in its threading mode (-nct).
It is possible to use the threading parameter -nct 8 in the call to MuTect2, but my experience (Malin) is that it doesn't work. I always get the following error (or similar) when I use the -nct parameter:

ERROR --

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Alignment
60055114 | 38S113M
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:580)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:436)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:427)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:548)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:197)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipToRegion(ReadClipper.java:377)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipToRegion(ReadClipper.java:352)
at org.broadinstitute.gatk.utils.activeregion.ActiveRegion.trim(ActiveRegion.java:482)
at org.broadinstitute.gatk.utils.activeregion.ActiveRegion.trim(ActiveRegion.java:437)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ActiveRegionTrimmer$Result.getCallableRegion(ActiveRegionTrimmer.java:3
83)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:551)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:176)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 60055114 | 38S113M

ERROR ------------------------------------------------------------------------------------------

The problem is discussed on GATK forum:
http://gatkforums.broadinstitute.org/gatk/discussion/6936/mutect2-error-message-somehow-the-requested-coordinate-is-not-covered-by-the-read-too-many-deleti

Solution: Remove the -nct parameter from the call to MuTect2 and instead manually split the analysis into one job per chromosome arm. I think we already do this, don't we?

scilifelab / sarek Goto Github PK

sarek's Introduction

This pipeline has moved!

An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing

Introduction

Workflow steps

Documentation

Contributions & Support

CHANGELOG

Credits

sarek's People

Contributors

Stargazers

Watchers

Forkers

sarek's Issues

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 60055114 | 38S113M

ERROR ------------------------------------------------------------------------------------------

Recommend Projects

Recommend Topics

Recommend Org