Git Product home page Git Product logo

scilifelab / sarek Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nf-core/sarek

131.0 19.0 7.0 44.88 MB

Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing

Home Page: https://nf-co.re/sarek

License: MIT License

Python 14.71% R 26.63% Groovy 2.31% Shell 6.30% Awk 1.42% Nextflow 47.89% Dockerfile 0.74%
nextflow workflow pipeline bioinformatics genomics cancer next-generation-sequencing docker singularity reproducible-science reproducible-research containers germline germline-variants somatic somatic-variants

sarek's Introduction

Sarek

This pipeline has moved!

This pipeline has been moved to the new nf-core project. You can now find it here:

https://github.com/nf-core/sarek

If you have any problems with the pipeline, please create an issue at the above repository instead.

To find out more about nf-core, visit http://nf-co.re/

This repository will be archived to maintain the released versions for future reruns, in the spirit of full reproducibility.

If you have any questions, please get in touch: [email protected]

// Maxime Garcia, 2020-01-27

An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing

Nextflow version Travis build status Join the chat on https://gitter.im/SciLifeLab/Sarek

MIT License Sarek version DOI

Install with bioconda Docker Container available

Introduction

Previously known as the Cancer Analysis Workflow (CAW), Sarek is a workflow designed to run analyses on WGS data from regular samples or tumour / normal pairs, including relapse samples if required.

It's built using Nextflow, a domain specific language for workflow building. Software dependencies are handled using Docker or Singularity - container technologies that provide excellent reproducibility and ease of use. Singularity has been designed specifically for high-performance computing environments. This means that although Sarek has been primarily designed for use with the Swedish UPPMAX HPC systems, it should be able to run on any system that supports these two tools.

Sarek was developed at the National Genomics Infastructure and National Bioinformatics Infastructure Sweden which are both platforms at SciLifeLab. It is listed on the Elixir - Tools and Data Services Registry.

Workflow steps

Sarek is built with several workflow scripts. A wrapper script contained within the repository makes it easy to run the different workflow scripts as a single job. To test your installation, follow the tests documentation.

Raw FastQ files or aligned BAM files (with or without realignment & recalibration) can be used as inputs. You can choose which variant callers to use, plus the pipeline is capable of accommodating additional variant calling software or CNV callers if required.

The worflow steps and tools used are as follows:

  1. Preprocessing - main.nf (based on GATK best practices)
  2. Germline variant calling - germlineVC.nf
  3. Somatic variant calling - somaticVC.nf (optional)
  4. Annotation - annotate.nf (optional)
  5. Reporting - runMultiQC.nf

Documentation

The Sarek pipeline comes with documentation in the docs/ directory:

  1. Installation documentation
  2. Installation documentation specific for UPPMAX rackham
  3. Installation documentation specific for UPPMAX bianca
  4. Tests documentation
  5. Reference files documentation
  6. Configuration and profiles documentation
  7. Intervals documentation
  8. Running the pipeline
  9. Running the pipeline using Conda
  10. Command line parameters
  11. Examples
  12. Input files documentation
  13. Processes documentation
  14. Documentation about containers
  15. Complementary information about ASCAT
  16. Complementary information about annotations
  17. Output documentation structure

Contributions & Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on Gitter or contact us: [email protected], [email protected]

CHANGELOG

Credits

Main authors:

Helpful contributors:


SciLifeLab NGI NBIS

sarek's People

Contributors

alneberg avatar apeltzer avatar arontommi avatar ewels avatar gongyixiao avatar j35p312 avatar malinlarsson avatar marcelm avatar maxulysse avatar nilesh-tawari-elanco avatar pallolason avatar pditommaso avatar sebastian-d avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sarek's Issues

Use recalibrated files

For the DREAM challenge we have recalibrated files already. We need an option to use recalibrated files to run benchmarks.

VarDictJava

Problem with VarDictJava and out test data.
It doesn't even want to launch
I I believe the help, on problem is the -z 1 parameters

-z 0/1
   Indicate whether is zero-based coordinates, as IGV does.  Default: 1 for BED file or amplicon BED file.  Use 0 to turn it off.
   When use -R option, it's set to 0

Because we're using it with -R chr17:1000000-1100000

But when I remove the -z 1
I get a lot of errors everywhere

Use of uninitialized value $exon in substitution (s///) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 567.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substr at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substitution (s///) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 567.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value $exon in substr at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.
Use of uninitialized value in addition (+) at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 568.

and then a whole lot of bunch of

Use of uninitialized value in join or string at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 1489.
Use of uninitialized value in join or string at /sw/apps/bioinfo/VarDictJava/1.4.5/milou/VarDictJava/VarDict/vardict line 1490.

Watch and correct efficency

I had a look at your hours and it looks like you are running on about 50% efficiency overall. So half the hours i give you will more or less not be used. It looks like many of the nextflow scripts you run should be modified, and told to run on a single core instead of a whole node. For example the mergeBam jobs, and the markduplicates should probably only run on 1-2 cores as well. Please have a look at this.

ASCAT

Get ASCAT to work with .bam files as input, and call it in Nextflow

Use GATK merge

Instead of samtools: the point is that older samtools versions that are on some platforms are treating read groups incorrectly. So have to have use MergeSamFiles from picard instead.

problem with fastq files

instrument IDs to be fixed on :
HCC1954.normal_S22_L001_R1_001.fastq.gz
HCC1954.normal_S22_L001_R2_001.fastq.gz
HCC1954.tumor2_S24_L002_R1_001.fastq.gz
HCC1954.tumor2_S24_L002_R2_001.fastq.gz
HCC1954.tumor3_S25_L001_R1_001.fastq.gz
HCC1954.tumor3_S25_L001_R2_001.fastq.gz

Write a comprehensive README

The README.md should contain a guide about stages, how to start preprocessing and variant call without preprocessing, etc.

TCGA benchmark S1

Start benchmarking on the simplest dataset S1 for pure normal/tumor data

Add MuTect1

Seems MuTect2 is in beta, and will be for a while. To get SNPs, we have to add MuTect1

Simulate SV data for validation

To validate Manta/Delly/Lumpy we need a set of known structural variants. To make it relatively fast, first we need a data set with known variants. First we include deletion, insertion, inversion, translocation and duplication for a small reference:

  • three chromosomes only
  • each from HG19
  • each chromosome is a ~1Mb subset
  • reads are simulated either by wgsim or stampy

Update cosmic version

Updated cosmic database version from v54 to v74 in /pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37 by adding b37_cosmic_v74.vcf and b37_cosmic_v74.vcf.idx

Change pointer in milou.config to new files.

Strelka has problem with the reference index

Tried to run Strelka on our test samples, and run into this error:
ERROR: Unexpected format for line number '1' of fasta index file: '/pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37/human_g1k_v37_decoy.fasta.fai' Re-running fasta indexing may fix the issue. To do so, run: "samtools faidx /pica/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37/human_g1k_v37_decoy.fasta"

Check UNIX pipes

We are using pipes (i.e. in Mapping) and when they fail at the beginning, only the last status is picked up by nextflow. Better to check PIPESTATUS and fail early.

Remove hard-coded parameters

Reporting some issues I've had with the workflow.

First: Line 1086 of MultiFQtoVC.nf should probably be erased.
https://github.com/SciLifeLab/CAW/blob/master/MultiFQtoVC.nf#L1086

Then, would it be possible to make all the

    cpus 8
    memory { 16.GB * task.attempt }
    time { 20.h * task.attempt }
    errorStrategy { task.exitStatus == 143 ? 'retry' : 'terminate' }
    maxRetries 3
    maxErrors '-1'

for example: https://github.com/SciLifeLab/CAW/blob/master/MultiFQtoVC.nf#L324-L329
and update it to a nodeSetup() function, or so?

So that it is not hard-coded.

Or something like

    cpus $nodeCPUs
    memory { $nodeRAM * task.attempt }
    time { $nodeTimeSlice * task.attempt }
    errorStrategy { task.exitStatus == 143 ? 'retry' : 'terminate' }
    maxRetries 3
    maxErrors '-1'

Set of variant callers integrated

List of somatic callers at https://omictools.com/somatic-snp-detection-category

DREAM Challenge runs

These runs are to get results from Strelka and MuTect1 from all the DREAM datasets

Use GATK 3.6 only

Now we are using different versions for calibration/realignment and for variant call (HaplotypeCaller and MuTect2) . The v3.6 jar should contain all these tools, we should use only that and the variabla gatkHome (not mutect2Home)

Make an error if a file is missing in the tsv file

I moved my .tsv file, so the links to the fastq files where not right anymore, but the workflow managed to work untill the CreateIntervals process.
So we should definitively add a verification on the files in the .tsv file.

Skipping IndelRealignment

According to multiple blogposts, indel realignment is not necessary for many variant callers by now. We should investigate once there are a couple of VCs integrated.

Assemble new test data based on TCGA benchmark

Find suitable samples/files from the TCGA dataset that can be used as a defined test set. It should include a normal sample and several tumor samples from the same individual. We should split the fastq files into multiple files per sample, and rename them so that they mimic the file structure at NGI. Based on this we will then parse the read group information from the fastq file name according to the description in CAW/doc/files_and_folder_info.md.

The full TCGA data set is described here: http://hgwdev.soe.ucsc.edu/~ewingad/benchmark4/instructions.pdf

The HCC1954 tumor sample has a mutation in TP53, which could be good "test case".
This somatic mutation was detected in a previous MuTect1 run (normal in the first genotype column, tumor in the second)
7578442 rs148924904 T C . PASS AC=5;AN=11;DB;SF=0,1;SOMATIC;VT=SNP GT:BQ:DP:FA:SS:AD 0:.:87:0.00:0:86,0 0/1:34:59:1.00:2:0,59

Good 100 000 bp region to extract is: chr17:7528442-7628442

Data will be extracted from the following bam files:
"normal"
/sw/data/uppnex/ToolBox/TCGAbenchmark/34c9ff85-c2f8-45dc-b4aa-fba05748e355/G15512.HCC1954_BL.1.bam
"tumor1"
/sw/data/uppnex/ToolBox/TCGAbenchmark/6d8044f7-3f63-487c-9191-adfeed4e74d3/G15512.HCC1954.1.bam
"tumor2" (60% tumor, 40% normal)
/sw/data/uppnex/ToolBox/TCGAbenchmark/02d8b3de-b043-4bfa-9130-adc18195313f/HCC1954.mix1.n40t60.bam
"tumor3" (20% tumor, 80% normal)
/sw/data/uppnex/ToolBox/TCGAbenchmark/360b4736-6c5e-48df-af58-c1cf51609350/HCC1954.mix1.n80t20.bam

Data extraction
Extract bam files for the above region:

samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/34c9ff85-c2f8-45dc-b4aa-fba05748e355/G15512.HCC1954_BL.1.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.normal
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/6d8044f7-3f63-487c-9191-adfeed4e74d3/G15512.HCC1954.1.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor1
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/02d8b3de-b043-4bfa-9130-adc18195313f/HCC1954.mix1.n40t60.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor2
samtools view -bh /sw/data/uppnex/ToolBox/TCGAbenchmark/360b4736-6c5e-48df-af58-c1cf51609350/HCC1954.mix1.n80t20.bam 17:7528442-7538442 | samtools sort -n -m 4G - HCC1954.part.tumor3

Convert to fastq:

bedtools bamtofastq -i HCC1954.part.normal.bam -fq HCC1954.part.normal_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor1.bam -fq HCC1954.part.tumor1_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor2.bam -fq HCC1954.part.tumor2_R1.fastq -fq2 HCC1954.part.normal_R2.fastq
bedtools bamtofastq -i HCC1954.part.tumor3.bam -fq HCC1954.part.tumor3_R1.fastq -fq2 HCC1954.part.normal_R2.fastq

Split each fastq into two smaller files that mimic "lanes", with file names that it looks like what comes out of NGI:
sample_index_lane_R1_001.fastq.gz
for example
5-D-91-01_S22_L005_R1_001.fastq.gz
Where
5-D-91-01 = sample
S22 = index
L005 = lane
and the ending "_001.fastq.gz" should always be there

The fastq files have these number of rows:
HCC1954.part.normal_R1.fastq 20012 (same for HCC1954.part.normal_R2.fastq)
HCC1954.part.tumor1_R1.fastq 13116
HCC1954.part.tumor2_R1.fastq 6592
HCC1954.part.tumor3_R1.fastq 7316

head -n 10000 HCC1954.part.normal_R1.fastq > HCC1954.normal_S22_L001_R1_001.fastq
head -n 10000 HCC1954.part.normal_R2.fastq > HCC1954.normal_S22_L001_R2_001.fastq 
tail -n 10012 HCC1954.part.normal_R1.fastq >  HCC1954.normal_S22_L002_R1_001.fastq 
tail -n 10012 HCC1954.part.normal_R2.fastq >  HCC1954.normal_S22_L002_R2_001.fastq 

head -n 7000 HCC1954.part.tumor1_R1.fastq > HCC1954.tumor1_S23_L001_R1_001.fastq 
head -n 7000 HCC1954.part.tumor1_R2.fastq > HCC1954.tumor1_S23_L001_R2_001.fastq
tail -n 6116 HCC1954.part.tumor1_R1.fastq >  HCC1954.tumor1_S23_L002_R1_001.fastq 
tail -n 6116 HCC1954.part.tumor1_R2.fastq >  HCC1954.tumor1_S23_L002_R2_001.fastq 

head -n 3296 HCC1954.part.tumor2_R1.fastq > HCC1954.tumor2_S24_L001_R1_001.fastq 
head -n 3296 HCC1954.part.tumor2_R2.fastq > HCC1954.tumor2_S24_L001_R2_001.fastq
tail -n 3296 HCC1954.part.tumor2_R1.fastq >  HCC1954.tumor2_S24_L002_R1_001.fastq 
tail -n 3296 HCC1954.part.tumor2_R2.fastq >  HCC1954.tumor2_S24_L002_R2_001.fastq 

head -n 3600 HCC1954.part.tumor3_R1.fastq > HCC1954.tumor3_S25_L001_R1_001.fastq 
head -n 3600 HCC1954.part.tumor3_R2.fastq > HCC1954.tumor3_S25_L001_R2_001.fastq
tail -n 3716 HCC1954.part.tumor3_R1.fastq >  HCC1954.tumor3_S25_L002_R1_001.fastq 
tail -n 3716 HCC1954.part.tumor3_R2.fastq >  HCC1954.tumor3_S25_L002_R2_001.fastq 

Finally
gzip *_001.fastq

The final gziped files are stored on Milou in
/sw/data/uppnex/ToolBox/TCGAbenchmark/chr17_multilane/

Annotation of results

Select/implement a set of software that are annotating the result VCFs before ranking. Like:

  • snpEff
  • vcfanno

Get rid of warning Duplicate output channel name

WARN: Duplicate output channel name: 'tiny' in the script context -- it's worth to rename it to avoid possible conflicts
tiny is the patientId
The warning message is not very helpful, so I don't know where this problem came from, so still looking...

Process arbitrary number of fastq input files

We may have to process a number of fastq pairs.

  • pair sample (from multiplexing)
  • multiple samples: normal, tumor, tumor 2 ... tumor N

Probably need a channel of fastqpairs as input
need to keep track of read group (RG) info and tumor/normal status...

remove threading from MuTect2.

MuTect2 has a bug in its threading mode (-nct).
It is possible to use the threading parameter -nct 8 in the call to MuTect2, but my experience (Malin) is that it doesn't work. I always get the following error (or similar) when I use the -nct parameter:

ERROR --
ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Alignment
60055114 | 38S113M
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:580)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:436)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:427)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:548)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:197)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipToRegion(ReadClipper.java:377)
at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipToRegion(ReadClipper.java:352)
at org.broadinstitute.gatk.utils.activeregion.ActiveRegion.trim(ActiveRegion.java:482)
at org.broadinstitute.gatk.utils.activeregion.ActiveRegion.trim(ActiveRegion.java:437)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ActiveRegionTrimmer$Result.getCallableRegion(ActiveRegionTrimmer.java:3
83)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:551)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:176)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 60055114 | 38S113M
ERROR ------------------------------------------------------------------------------------------

The problem is discussed on GATK forum:
http://gatkforums.broadinstitute.org/gatk/discussion/6936/mutect2-error-message-somehow-the-requested-coordinate-is-not-covered-by-the-read-too-many-deleti

Solution: Remove the -nct parameter from the call to MuTect2 and instead manually split the analysis into one job per chromosome arm. I think we already do this, don't we?

Add trace and timeline default

nextflow can generate trace (process state) and timeline files by adding the --trace and --timeline options. These should be switched on by default.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.