cnr-ibba / nf-resequencing-mem Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 2.58 MB

Nextflow resequencing pipeline with bwa-mem and freebayes

License: MIT License

Nextflow 54.86% Python 18.87% Groovy 26.27%

bcftools bwa-mem freebayes nextflow nextflow-dsl2 nextflow-pipeline resequencing

nf-resequencing-mem's People

Contributors

Watchers

nf-resequencing-mem's Issues

:bug: check_samplesheet can't detect header in certain conditions

check_samplesheet.py can't detect header in certain case, even if it's provided correctly. This could be due to csv.Sniffer.has_header method, which is known to produce both false positive and negatives: for exaple

sample,fastq_1,fastq_2
200-1-5,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R1_001.fastq.gz,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R2_001.fastq.gz
201-1-9,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R1_001.fastq.gz,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R2_001.fastq.gz
202-1-10,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R1_001.fastq.gz,/home/ngs/freeclimb_resequencing/IGA_2022-01/delivery_20220110/raw_sequences/1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R2_001.fastq.gz

has issue in detecting header

:sparkles: add a step for sample coverage

Add a step which describe the coverage by sample

:zap: replace `picard/markduplicates` with `samtools/markdup`

According to samtools/markdup documentation, there are additional steps to be done before calling the markdup steps. This steps are performed by bam_markduplicates_samtools nf-core subworkflow, however they need to be performed before the sort step (that is done during alignment)

install samtools/markdup module
patch samtools/markdup module to produce reports
~~fix and rename cram_markduplicates_picard local subworflow~~
install bam_markduplicates_samtools nf-core workflow
remove picard/markduplicates modules and configuration
check samtools/markdup with MultiQC config
test --save_cram option

:sparkles: Annotate VCF file with SnpEff

add snpeff to the pipeline
conditional check for snpeff analysis
snpeff using community databases
snpeff using custom database
compress and index VCF file

:wrench: configure local resources like current nextflow pipelines

add config/base.config
add config/module.config
move stuff in config files
check for local resource availability
test CI within github workflow
~~add a default configuration for core environment~~

:sparkles: check that reads IDs are unique

If there are duplicate IDs in FASTQ files, the markduplicate step will fail.

add seqkit step to check for IDs duplicate
test stuff

:arrow_up: upgrade module dependencies

:heavy_plus_sign: switch to nf-validation plugin

update or remove samplesheet_check
add nf-validation plugin
update assets
test parameters and validation

:zap: replace `picard/markduplicates` with `sambamba/markdup`

Try to replace markduplicates with sambamba as described here

install sambamba/markdup module
fix and rename cram_markduplicates_picard local subworflow
remove picard/markduplicates modules and configuration
test --save_cram option

:zap: optimize `samtools/depth` step

~~use bgzip with multi-threads when creating samtools/depth~~
try to split samtools/depth by chromosomes
raise samtools/depth required time

:wrench: update pipeline limits

Try to increase pipeline default parameters to better explot resources

:sparkles: Test in AWS environment

Test this pipeline in AWS environment

~~move test data to S3~~
call pipeline on AWS
~~collect data from S3~~

:sparkles: deal with compressed genomes

Freebayes can't run on a compressed genome sequence. Add a new step to generate a FAI index to be used in freebayes step

:bookmark: release this pipeline to the public

follow Using nf-core components outside nf-core tutorial and release pipeline to the public

:zap: call freebayes in different processes

import the freebayes_parallel subworkflow
change pipeline structure and parameter
test with real data

:arrow_up: upgrade module dependencies and test for `-stub` option

update module dependencies
update custom module dependencies
remove the enable_conda params / or remove conda support
fix changed steps
fix the default container registry for custom modules (prepend docker.io to container URL)
test for -stub option

:zap: improve performance of coverage step

Try to optimize the coverage step

:zap: replace `bwa` with `bwa-mem2` modules

bwa-mem2 seems to be an improved implementation of bwa mem. Replace old bwa modules in pipeline and test data

replace bwa/index with bwamem2/index
replace bwa/mem with bwamem2/mem

:arrow_up: upgrade modules

upgrade nf-core modules

:sparkles: deal with CRAM aligments

use samtools view to convert CRAM into BAM -> CRAM convert with SAMTOOLS/CRAM module:

:bug: solve linting issues

Can't lint pipeline anymore. Something changes in modules? check private modules structure

Test in PBS environment

Test this pipeline in a PBS environment

install singularity
test with conda using PBS
test with singularity using PBS

:boom: launch freebayes on all samples

move freebayes/single to freebayes/multi
~~try to split by chromosomes and calculate freebayes on each chromosome (with all samples)~~
chunk chromosomes as described in freebayes parallelization
collect all data in a unique VCF file
accept bwa indexes and genome indexes as parameters

:bug: update pipeline to support latest nf-core/tools

nf-core/tools v2.7.1 cannot lint pipeline anymore: there are some changes which could affect CI test. Pipeline need to be updated and lint test should pass without warnings:

Additional information

pipeline name has issue: .nf-core.yml should change like this:

lint:
  actions_awsfulltest: False
  pipeline_todos: False
  files_exist: False
  nextflow_config:
    - manifest.name
  files_unchanged:
    - .gitignore
    - .github/workflows/linting.yml
  actions_ci: False
  merge_markers: False
  multiqc_config: False
repository_type: pipeline

v2.7.1 has issue in parsing this file, the dev branch of nf-core/tools can parse this config file correctly

:construction_worker: update CI system

CI fails because of new templates

:zap: split chromosome relying on BAM size while running freebayes

use split_ref_by_bai_datasize.py in freebayes/multi module

:sparkles: use MultiQC with all supported tools

Try to generate a MultiQC report with all supported tools

:sparkles: accept input data as samples data file

This pipeline currently cannot deal with single end libraries. Moreover, is difficult to manage single and paired ends samples in the same run. Currently, nextflow pipelines accept input data from a sample files, where single and paired ends files are clearly stated and managed properly. Adapt this pipeline in order to accept data like this

:bug: markduplicates has issues with cram files

By visually inspecting the alignments (using samtools tview, there are some regions which seems to have bad alignments: the reason seems to be that Markduplicates changes the sequence in aligned file. For example, before markduplicates we have:

samtools view WT.cram Chr01:60000-60001 --reference Pvulgaris_442_v2.0.fa | head -n1
A01083:294:H3LHWDSXC:2:2553:27751:35556 147     Chr01   59859   60      150M    =       59547   -462    CGCCGCGTCTTTTAAGAAAATAGCGGGAGAAGAAACTTCGATTTTCAATAACAATGAAGGTAAATTAAATTGATAAATTTTATATTCAATTGATAGCAATAAATCACGCAAATATGTAAATTGAAATATTTATTTTAAAGTTTCGATAAC FFF:,FFF:F:FFF,FFFFFF:,F,,,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFMC:Z:150M       AS:i:145        XS:i:20 MD:Z:24T125     NM:i:1  RG:Z:WT

And after markduplicates we have:

samtools view WT.md.cram Chr01:60000-60001 --reference Pvulgaris_442_v2.0.fa | head -n1
A01083:294:H3LHWDSXC:2:2553:27751:35556 147     Chr01   59859   60      150M    =       59547   -462    NCNNCNCGNGGGGCCCCCCCGCCNCCCCCCCCCCCNGGNCCGGGGNCCGCCNCCGCCCCCGCCCGGCCCGGCCGCCCGGGGCGCGGNCCGGCCGCCNCCGCCCGNCNCNCCCGCGCGCCCGGCCCCGCGGGCGGGGCCCCGGGNCCGCCN FFF:,FFF:F:FFF,FFFFFF:,F,,,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFAS:i:145        MC:Z:150M       PG:Z:MarkDuplicates     XS:i:20 MD:Z:0C0G0C0C0G0C0G0T0C0T0T0T0T0A0A0G0A0A0A0A0T0A0G0C0T0G0G0A0G0A0A0G0A0A0A0C0T0T0C0G0A0T0T0T0T0C0A0A0T0A0A0C0A0A0T0G0A0A0G0G0T0A0A0A0T0T0A0A0A0T0T0G0A0T0A0A0A0T0T0T0T0A0T0A0T0T0C0A0A0T0T0G0A0T0A0G0C0A0A0T0A0A0A0T0C0A0C0G0C0A0A0A0T0A0T0G0T0A0A0A0T0T0G0A0A0A0T0A0T0T0T0A0T0T0T0T0A0A0A0G0T0T0T0C0G0A0T0A0A0C0    NM:i:150        RG:Z:WT

Then number of matches is identical, however markduplicates add 150 mismatches, and the sequence changed in column 10 is the sequence visualized using samtools tview. This behaviour does not affect all the genome regions. Is not clear how this affects the calling process. Markduplicates should be removed as described in #71