Git Product home page Git Product logo

mycosnp-nf's Introduction

CDCgov/mycosnp-nf CDCgov/mycosnp-nf

Nextflow run with conda run with docker run with singularity

Introduction

nf-core/mycosnp is a bioinformatics best-practice analysis pipeline for MycoSNP is a portable workflow for performing whole genome sequencing analysis of fungal organisms, including Candida auris. This method prepares the reference, performs quality control, and calls variants using a reference. MycoSNP generates several output files that are compatible with downstream analytic tools, such as those for used for phylogenetic tree-building and gene variant annotations..

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

Pipeline summary

Reference Preparation

Prepares a reference FASTA file for BWA alignment and GATK variant calling by masking repeats in the reference and generating the BWA index.

  • Genome repeat identification and masking (nucmer)
  • BWA index generation (bwa)
  • FAI and DICT file creation (Picard, Samtools)

Sample QC and Processing

Prepares samples (paired-end FASTQ files) for GATK variant calling by aligning the samples to a BWA reference index and ensuring that the BAM files are correctly formatted. This step also provides different quality reports for sample evaluation.

  • Combine FASTQ file lanes if they were provided with multiple lanes.
  • Filter unpaired reads from FASTQ files (SeqKit).
  • Down sample FASTQ files to a desired coverage or sampling rate (SeqTK).
  • Trim reads and assess quality (FaQCs).
  • Generate a QC report by extracting data from FaQCs report data.
  • Align FASTQ reads to a reference (BWA).
  • Sort BAM files (SAMTools).
  • Mark and remove duplicates in the BAM file (Picard).
  • Clean the BAM file (Picard "CleanSam").
  • Fix mate information in the BAM file (Picard "FixMateInformation").
  • Add read groups to the BAM file (Picard "AddOrReplaceReadGroups").
  • Index the BAM file (SAMTools).
  • FastQC - Filtered reads QC.
  • Qualimap mapping quality report.
  • MultiQC - Aggregate report describing results and QC from the whole pipeline

Variant calling and analysis

Calls variants and generates a multi-FASTA file and phylogeny.

  • Call variants (GATK HaplotypeCaller).
  • Combine gVCF files from the HaplotypeCaller into a single VCF (GATK CombineGVCFs).
  • Call genotypes using the (GATK GenotypeGVCFs).
  • Filter the variants (GATK VariantFiltration) [default (but customizable) filter: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || DP < 10'].
  • Run a customized VCF filtering script (Broad Institute).
  • Split the filtered VCF file by sample.
  • Select only SNPs from the VCF files (GATK SelectVariants).
  • Split the VCF file with SNPs by sample.
  • Create a multi-fasta file from the VCF SNP positions using a custom script (Broad).
  • Create a distance matrix file using multi-fasta file(SNPdists).
  • Create phylogeny from multi-fasta file (rapidNJ, FastTree2, quicksnp, RaxML(optional), IQTree(optional))

Variant annotation analysis (currently available for C. auris B11205 genome only)

  • annotated VCF file (snpEff)
  • combined output report(SnpEffR)

Quick Start

  1. Install Nextflow (>=21.10.3)

  2. Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs)

  3. Requires Python version >3.0

  4. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run CDCgov/mycosnp-nf -profile test,YOURPROFILE

    Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

    • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
    • Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
    • If you are using singularity and are persistently observing issues downloading Singularity images directly due to timeout or network issues, then you can use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Alternatively, you can use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
    • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  5. Start running your own analysis!

    nextflow run CDCgov/mycosnp-nf -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv --fasta c_auris.fasta
  6. It is advisable to delete large temporary or log files after the successful completion of the run. It takes a lot of space and may cause issues in future runs.

Pre-configured Nextflow development environment using Gitpod

Open CDCgov/mycosnp-nf in Gitpod

Once the pod launches, it will present a VS-Code interface and comes with Nextflow, Conda and Docker pre-installed

Documentation

The nf-core/mycosnp pipeline comes with documentation about the pipeline usage, parameters and output.

Credits

nf-core/mycosnp was originally written by CDC.

We thank the following people for their extensive assistance in the development of this pipeline:

Special thanks the Staph-B Slack workspace for open-source collaborations and discussions.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the MycoSNP and nf-core publications as follows:

Bagal UR, Phan J, Welsh RM, Misas E, Wagner D, Gade L, Litvintseva AP, Cuomo CA, Chow NA.

MycoSNP: A Portable Workflow for Performing Whole-Genome Sequencing Analysis of Candida auris.

Methods Mol Biol. 2022; 2517:215-228. doi: 10.1007/978-1-0716-2417-3_17

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

CDCgov GitHub Organization Open Source Project Template

Template for clearance: This project serves as a template to aid projects in starting up and moving through clearance procedures. To start, create a new repository and implement the required open practices, train on and agree to adhere to the organization's rules of behavior, and send a request through the create repo form using language from this template as a Guide.

General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Related documents

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

mycosnp-nf's People

Contributors

cjjossart avatar drpatelh avatar hseabolt avatar jwarnn avatar leebrian avatar leuthrasp avatar mciprianocdc avatar mjcipriano avatar rpetit3 avatar sateeshperi avatar urbagal avatar zmudge3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mycosnp-nf's Issues

Minimum read length parameter

Is your feature request related to a problem? Please describe.
I have tried to use mycosnp for some of the reference Candida auris clade specific SRA datasets + my own Candida auris sample which is paired end Illumina dataset but with untypical read length i.e. 120 bp for Read 1 and 30 bp for Read 2.

Describe the solution you'd like
Please add parameter minimum read length for the read preprocessing step. I have checked the pipeline's output files and my untypical dataset is trimmed by FAQCs so much that there are no paired reads remaining after trimming step.

Describe alternatives you've considered
I cannot find the minimum read length parameter in the nextflow pipeline modules so I cannot suggest any alternatives.

Additional context
Please let me know how to configure the trimming step in the pipeline or add suggested parameter.

Best regards,
Jan

Samplesheet CSV file with dos line endings does not recognize second sequence

Describe the bug
Samplesheet CSV file with dos line endings does not recognize second sequence

Impact
Sequences will not be processed correctly

To Reproduce
Steps to reproduce the behavior:
Create a samplesheet in notepad on windows and use that, it will contain dos line endings.

Expected behavior
The pipeline should detect the dos line endings and ignore them.

Screenshots

Logs
Fails on pair step.

running this in Google Cloud?

sorry for the possibly snarky question, but has this workflow been tested in Google Cloud?

is anyone running it in Google Cloud. We continue to have substantial and repeated problems in getting it to run to completion in Google Cloud.

Formatting error in qc_report.txt

Description:
Ran the sra_large.csv dataset (22 samples). The resulting qc_report.txt has a formatting error and has unexpected text in some of the rows (e.g., CP060343.1 1028427).

Impact:
The qc results do not correspond to the file headers in some columns, making it difficult to interpret.

To reproduce:
nextflow run CDCgov/mycosnp-nf -profile singularity --add_sra_file sra_large.csv --fasta GCA_016772135.1_ASM1677213v1_genomic.fna.gz

File attachments
sample file: sra_large.csv
resulting qc_report: qc.txt
qc report opened in excel: qc_excel.xlsx

Need an error handling strategy for probably corrupted files from SRA

Error executing process > 'NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:SEQKIT_PAIR (B13520)'
Caused by:
Process `NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:SEQKIT_PAIR (B13520)` terminated with an error exit status (255)
Command executed:
 seqkit \
pair \
-1 SRR7909394_1.fastq.gz \
-2 SRR7909394_2.fastq.gz \
\
--threads 2;

cat <<-END_VERSIONS > versions.yml
"NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:SEQKIT_PAIR":
seqkit: $( seqkit | sed '3!d; s/Version: //' )
END_VERSIONS
Command exit status:
255
Command output:
(empty)
Command error:
[ERRO] SRR7909394_2.fastq.gz: unexpected EOF

NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:QUALIMAP_BAMQC

Describe the bug
Process NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:QUALIMAP_BAMQC (SAMPLE) terminated with an error exit status (137)

command to run nextflow:
./nextflow run mycosnp-nf/main.nf -profile docker --input auris-testing/new_samplesheet.csv --fasta auris-testing/B8441.fasta

Impact
Trying to understand, build and compare similar pipeline with nextflow. Now, the comparison is delayed.

Expected behavior
Expecting to run successful the pipeline. The test run show no error. But, clear details about usage required.

Desktop

  • OS: Ubuntu 20.04
  • Nextflow: 23.04.0
  • System: 16GB RAM, 8 core CPU

Additional context
I did faced same issue for some other samples. I have tried a total of 4 times running single sample, out of that 1 time the run is complete with no error. I have also faced this error: unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException

Though pipeline is running, errors are constant.

add `nextflow pull` to usage docs

if using mycosnp pulling directly from github, users have to make sure to nextflow pull CDCgov/mycosnp-nf first to update the local cached copy of the pipeline and check the version number by running nextflow run CDCgov/mycosnp-nf --help

qc_report formatting error

Sample Name     # Reads Before Trimming GC Before Trimming      Average Phred Before Trimming   Coverage Before Trimming        # Reads After Trimming  # Paired Reads After Trimming   # Unpaired Reads After Trimming GC After Trimming       Average Phred After Trimming    Coverage After Trimming
2014730900_GAD5441A14_TCCGGAGA-TAATCTTA 3283574 43.87%  35.10   3266756 (99.49 %)       3249992 (99.49 %)       16764 (0.51 %)  43.76%  33.96   197.05840522188217
2014730904_GAD5441A16_TCCGGAGA-GTACTGAC 2741120 43.76%  34.91   2725121 (99.42 %)       2709152 (99.41 %)       15969 (0.59 %)  43.61%  33.70   164.50390206579954
3000656785_GAD5582A1_AGCGATAG-TATAGCCT  2988822 43.72%  35.55   2974093 (99.51 %)       2959384 (99.51 %)       14709 (0.49 %)  43.58%  34.53   179.3693386572303
3000726373_GAD5913A3_TCCGCGAA-CCTATCCT  3024298 44.02%  35.30   3006200 (99.40 %)       2988144 (99.40 %)       18056 (0.60 %)  43.83%  33.82   181.49837366105587
3000806103_CHO6670A4_TAATGCGC-GTACTGAC  3494818 43.54%  33.45   3470610 (99.31 %)       3447810 (99.34 %)       22800 (0.66 %)  43.27%  29.56   251.5159281752104
3000807880_GAD6955A17_CGCTCATT-TATAGCCT 2853186 43.76%  35.07   2844098 (99.68 %)       2835040 (99.68 %)       9058 (0.32 %)   43.61%  33.86   171.2293625669472
3000813650_GAD6973A7_CTGAAGCT-CCTATCCT  3490930 45.42%  34.58   3482741 (99.77 %)       3474586 (99.77 %)       8155 (0.23 %)   45.27%  33.26   209.50254160290743
3001026848_S8   2715622 44.39%  31.87   2715397 (99.99 %)       2715174 (99.99 %)       223 (0.01 %)    44.40%  31.87   162.97368066182096
3001033245_S7   3489200 43.72%  34.45   3489111 (100.00 %)      3489022 (100.00 %)      89 (0.00 %)     43.72%  34.45   209.39871843917368
3001164744_S3   3493418 45.33%  34.01   3493341 (100.00 %)      3493266 (100.00 %)      75 (0.00 %)     45.33%  34.01   209.6518549158378
3001164751_S2   3497476 45.33%  33.53   3497397 (100.00 %)      3497318 (100.00 %)      79 (0.00 %)     45.33%  33.53   209.89538925019127
3001235204_S4   3502972 42.61%  34.94   3502863 (100.00 %)      3502754 (100.00 %)      109 (0.00 %)    42.61%  34.94   210.2252228385616
3001396883_S2   3096090 45.66%  34.50   3095998 (100.00 %)      3095906 (100.00 %)      92 (0.00 %)     45.66%  34.50   185.8068549158378
3001743891_S7   3483352 45.13%  35.71   3483206 (100.00 %)      3483060 (100.00 %)      146 (0.00 %)    45.13%  35.71   209.04776013771996
3001757322_S7   3084544 44.30%  34.88   3084130 (99.99 %)       3083850 (99.99 %)       280 (0.01 %)    44.30%  34.88   185.1139403213466
3002006067_S8   2976534 45.62%  35.07   2976410 (100.00 %)      2976288 (100.00 %)      122 (0.00 %)    45.62%  35.07   178.63189412777353
B16328_S9       3920608 43.46%  34.91   3918391 (99.94 %)       3916502 (99.95 %)       1889 (0.05 %)   43.47%  34.90   235.28897475133894
B16329_S8       3910890 44.20%  35.00   3908813 (99.95 %)       3907070 (99.96 %)       1743 (0.04 %)   44.21%  34.99   234.70576463274674
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870
CP060340.1 2308870

related Issue: #37

Raxmlng - replace * with - for input file

Describe the bug
In the raxmlNG step, "*" in the fasta file need to be replaced with - or else it fails.

Impact
Fail to run

To Reproduce
Any alignment which has a *

Expected behavior
Complete tree file

Screenshots

Logs

-[nf-core/mycosnp] Pipeline completed with errors-
Error executing process > 'NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG (1)'

Caused by:
Process NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG (1) terminated with an error exit status (1)

Command executed:

raxml-ng
--all --model GTR+G --bs-trees 1000
--msa combined_vcf-to-fasta.fasta
--threads 2
--prefix output

cat <<-END_VERSIONS > versions.yml
"NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG":
raxmlng: $(echo $(raxml-ng --version 2>&1) | sed 's/^.RAxML-NG v. //; s/released.$//')
END_VERSIONS

Command exit status:
1

Command output:
ERROR: Invalid character in sequence 2 at position 1020: *
ERROR: Invalid character in sequence 2 at position 1600: *
ERROR: Invalid character in sequence 2 at position 1703: *
ERROR: Invalid character in sequence 2 at position 2043: *
ERROR: Invalid character in sequence 2 at position 3069: *
ERROR: Invalid character in sequence 2 at position 8679: *
ERROR: Invalid character in sequence 2 at position 11069: *
ERROR: Invalid character in sequence 2 at position 20803: *
ERROR: Invalid character in sequence 2 at position 21924: *
ERROR: Invalid character in sequence 2 at position 21925: *
ERROR: Invalid character in sequence 2 at position 25064: *
ERROR: Invalid character in sequence 2 at position 28986: *
ERROR: Invalid character in sequence 2 at position 29056: *
ERROR: Invalid character in sequence 2 at position 31209: *
ERROR: Invalid character in sequence 2 at position 36528: *
ERROR: Invalid character in sequence 2 at position 37319: *
ERROR: Invalid character in sequence 2 at position 37967: *
ERROR: Invalid character in sequence 2 at position 39231: *
ERROR: Invalid character in sequence 2 at position 39880: *
ERROR: Invalid character in sequence 2 at position 42175: *
ERROR: Invalid character in sequence 2 at position 42176: *
ERROR: Invalid character in sequence 3 at position 5206: *
ERROR: Invalid character in sequence 3 at position 9832: *
ERROR: Invalid character in sequence 3 at position 12678: *
ERROR: Invalid character in sequence 3 at position 14401: *
ERROR: Invalid character in sequence 3 at position 14402: *
ERROR: Invalid character in sequence 3 at position 14430: *
ERROR: Invalid character in sequence 3 at position 28941: *
ERROR: Invalid character in sequence 3 at position 28942: *
ERROR: Invalid character in sequence 3 at position 31911: *
ERROR: Invalid character in sequence 3 at position 32935: *
ERROR: Invalid character in sequence 3 at position 33868: *
ERROR: Invalid character in sequence 3 at position 36010: *
ERROR: Invalid character in sequence 3 at position 42164: *
ERROR: Invalid character in sequence 3 at position 42165: *
ERROR: Invalid character in sequence 4 at position 3069: *
ERROR: Invalid character in sequence 4 at position 8679: *
ERROR: Invalid character in sequence 4 at position 11069: *
ERROR: Invalid character in sequence 4 at position 20799: *
ERROR: Invalid character in sequence 4 at position 21924: *
ERROR: Invalid character in sequence 4 at position 21925: *
ERROR: Invalid character in sequence 4 at position 25064: *
ERROR: Invalid character in sequence 4 at position 29056: *
ERROR: Invalid character in sequence 4 at position 36528: *
ERROR: Invalid character in sequence 4 at position 37319: *
ERROR: Invalid character in sequence 4 at position 37967: *
ERROR: Invalid character in sequence 4 at position 39880: *

ERROR: Alignment check failed (see details above)!

Command wrapper:
ERROR: Invalid character in sequence 2 at position 1020: *
ERROR: Invalid character in sequence 2 at position 1600: *
ERROR: Invalid character in sequence 2 at position 1703: *
ERROR: Invalid character in sequence 2 at position 2043: *
ERROR: Invalid character in sequence 2 at position 3069: *
ERROR: Invalid character in sequence 2 at position 8679: *
ERROR: Invalid character in sequence 2 at position 11069: *
ERROR: Invalid character in sequence 2 at position 20803: *
ERROR: Invalid character in sequence 2 at position 21924: *
ERROR: Invalid character in sequence 2 at position 21925: *
ERROR: Invalid character in sequence 2 at position 25064: *
ERROR: Invalid character in sequence 2 at position 28986: *
ERROR: Invalid character in sequence 2 at position 29056: *
ERROR: Invalid character in sequence 2 at position 31209: *
ERROR: Invalid character in sequence 2 at position 36528: *
ERROR: Invalid character in sequence 2 at position 37319: *
ERROR: Invalid character in sequence 2 at position 37967: *
ERROR: Invalid character in sequence 2 at position 39231: *
ERROR: Invalid character in sequence 2 at position 39880: *
ERROR: Invalid character in sequence 2 at position 42175: *
ERROR: Invalid character in sequence 2 at position 42176: *
ERROR: Invalid character in sequence 3 at position 5206: *
ERROR: Invalid character in sequence 3 at position 9832: *
ERROR: Invalid character in sequence 3 at position 12678: *
ERROR: Invalid character in sequence 3 at position 14401: *
ERROR: Invalid character in sequence 3 at position 14402: *
ERROR: Invalid character in sequence 3 at position 14430: *
ERROR: Invalid character in sequence 3 at position 28941: *
ERROR: Invalid character in sequence 3 at position 28942: *
ERROR: Invalid character in sequence 3 at position 31911: *
ERROR: Invalid character in sequence 3 at position 32935: *
ERROR: Invalid character in sequence 3 at position 33868: *
ERROR: Invalid character in sequence 3 at position 36010: *
ERROR: Invalid character in sequence 3 at position 42164: *
ERROR: Invalid character in sequence 3 at position 42165: *
ERROR: Invalid character in sequence 4 at position 3069: *
ERROR: Invalid character in sequence 4 at position 8679: *
ERROR: Invalid character in sequence 4 at position 11069: *
ERROR: Invalid character in sequence 4 at position 20799: *
ERROR: Invalid character in sequence 4 at position 21924: *
ERROR: Invalid character in sequence 4 at position 21925: *
ERROR: Invalid character in sequence 4 at position 25064: *
ERROR: Invalid character in sequence 4 at position 29056: *
ERROR: Invalid character in sequence 4 at position 36528: *
ERROR: Invalid character in sequence 4 at position 37319: *
ERROR: Invalid character in sequence 4 at position 37967: *
ERROR: Invalid character in sequence 4 at position 39880: *

ERROR: Alignment check failed (see details above)!

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

file "combined_vcf-to-fasta.fasta" contains asterisks "*"

Describe the bug
file "combined_vcf-to-fasta.fasta" contains asterisks "*"

Impact
The file can not be used in MEGA to generates phylogenetic trees

Additional context
-As far as I know the asterisks could be representing spanning deletions

-This issue was previously fixed in mycoSNP-genflow when version 0.19 was released.
(from a John's email message:) "mycosnp-gatk-variants now filters asterisks with the awk commands provided by Broad"

-Sharing an part of an email from Broad about:
"
For example, this command will remove all sites of the SNP spanning deletion
awk '$5 !~/[GATC],*/' test.vcf > test_clean_step1.vcf

There possibly could be another more rare class of sites that has the reverse listing (*, [GATC]), and these could be removed by a more general awk command:
awk '$5 !~/*/' test_clean_step1.vcf > test_clean.vcf
""

[feature request] consider checking for read-level contamination with Kraken2

Is your feature request related to a problem? Please describe.
Yes, many public health laboratories have limited experience working with Candida auris in the lab (identifying a species via wet lab techniques, maldi-tof, colony morphology, etc.) which may lead to sequencing of samples that are not pure cultures/isolates.

One example: mixed culture of C. auris and C. parapsilosis. We recently looked at a sample that had roughly 36.6% reads assigned to C. auris and 57.7% of reads assigned to C. parapsilosis. A de novo assembly of the FASTQs from this sample resulted in a genome size of roughly 25 Mbases, which also shows evidence of a mixed isolate.

MycoSNP currently does not check for read-level contamination AFAIK. Additionally with consensus/reference-based assembly, it may be difficult to identify mixed/contaminated samples given the current QC outputs from MycoSNP. The above mentioned sample did fail MycoSNP QC, due to a low GC% of 40.3 percent, but it was not obvious what was going on with the sample

kraken2 (along with a proper database & fine-tuned parameters) could be used to screen the reads for potential contamination and ensure that the reads going into assembly are indeed from C. auris alone.

Describe the solution you'd like

  • add a step early on in the workflow that runs the fastqs through kraken2
  • check the output kraken2 report and look for a significant percentage of reads (e.g. >80%) that map to Candida auris and low-to-no percentage of reads (e.g. <10%) to another Candida species or other contaminating species.
  • Provide the kraken2 report file as an output from the workflow

Describe alternatives you've considered
None.

Additional context
I had a bad time using the standard Kraken2 databases built off of sequences in RefSeq, it seems there are not any C. auris assemblies included and there are few other Candida species present. Nearly all of my test FASTQs were assigned unclassified by kraken2

I had good luck with the pre-built k2 database called "EuPathDB48" for Eukaryotic pathogens found here: https://benlangmead.github.io/aws-indexes/k2#:~:text=.txt-,EuPathDB48,-3

If you visit this link and CTRL+F for "Candida" you can see all Candida species present in the database https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt

The downside of this database is that it is huge so it required 34GB of RAM to run and obviously would be cumbersome for users to routinely download and use. Not practical for routine screening of FASTQs.

One alternative idea is to create a custom and small kraken2 database, potentially hosted on Azure cloud storage, Zenodo, or some other archival service, that is built using high quality Candida auris (and other Candida spp.) reference genomes and could be used to identify contamination between Candida species as well as other common contaminants (human? others?)

Example usage & results

$ ls k2-db-EuPathDB48/
EuPathDB48_Contents.txt       database150mers.kmer_distrib  database250mers.kmer_distrib  database50mers.kmer_distrib  hash.k2d     k2_eupathdb48_20201113.tar.gz  seqid2taxid.map
database100mers.kmer_distrib  database200mers.kmer_distrib  database300mers.kmer_distrib  database75mers.kmer_distrib  inspect.txt  opts.k2d                       taxo.k2d

# launch staphb kraken2 v2.1.2 docker image; fastq files and EuPathDB database files are in PWD
$ docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data -ti staphb/kraken2:2.1.2-no-db

# mostly standard parameters
$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
  --paired mixed-sample*.fastq.gz \
  --output mixed-sample.k2-EuPathDB48.out \
  --report mixed-sample.EuPathDB48.report.out

$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
  --paired good-Cauris-sample*.fastq.gz \
  --output good-Cauris-sample.k2-EuPathDB48.out \
  --report good-Cauris-sample.EuPathDB48.report.out

$ head -n 30 mixed-sample.EuPathDB48.report.out
  4.11  119260  119260  U       0       unclassified
 95.89  2785280 0       R       1       root
 95.89  2785280 0       R1      131567    cellular organisms
 95.89  2785280 1935    D       2759        Eukaryota
 95.72  2780203 0       D1      33154         Opisthokonta
 95.72  2780203 386     K       4751            Fungi
 95.70  2779695 282     K1      451864            Dikarya
 95.65  2778276 5       P       4890                Ascomycota
 95.64  2777974 25      P1      716545                saccharomyceta
 95.38  2770257 0       P2      147537                  Saccharomycotina
 95.38  2770257 0       C       4891                      Saccharomycetes
 95.38  2770257 119     O       4892                        Saccharomycetales
 57.73  1676934 0       F       766764                        Debaryomycetaceae
 57.73  1676934 0       F1      1535325                         Candida/Lodderomyces clade
 57.73  1676934 41      G       1535326                           Candida
 57.68  1675417 0       S       5480                                Candida parapsilosis
 57.68  1675417 1675417 S1      578454                                Candida parapsilosis CDC317
  0.04  1294    35      S       5476                                Candida albicans
  0.04  1163    1163    S1      237561                                Candida albicans SC5314
  0.00  96      96      S1      294748                                Candida albicans WO-1
  0.01  182     0       S       5482                                Candida tropicalis
  0.01  182     182     S1      294747                                Candida tropicalis MYA-3404
 37.63  1093076 0       F       27319                         Metschnikowiaceae
 37.63  1093076 517     G       36910                           Clavispora
 37.43  1087135 1289    G1      1540022                           Clavispora/Candida clade
 36.59  1062774 1062774 S       498019                              [Candida] auris
  0.48  13932   13932   S       45357                               [Candida] haemulonis
  0.31  9140    9140    S       1231522                             [Candida] duobushaemulonis
  0.19  5424    0       S       36911                             Clavispora lusitaniae
  0.19  5424    5424    S1      306902                              Clavispora lusitaniae ATCC 42720

$ head -n 30 good-Cauris-sample.EuPathDB48.report.out
  7.12  107209  107209  U       0       unclassified
 92.88  1398000 0       R       1       root
 92.88  1398000 0       R1      131567    cellular organisms
 92.88  1398000 884     D       2759        Eukaryota
 92.71  1395407 0       D1      33154         Opisthokonta
 92.71  1395407 0       K       4751            Fungi
 92.70  1395389 1       K1      451864            Dikarya
 92.70  1395371 3       P       4890                Ascomycota
 92.69  1395139 26      P1      716545                saccharomyceta
 92.52  1392677 0       P2      147537                  Saccharomycotina
 92.52  1392677 0       C       4891                      Saccharomycetes
 92.52  1392677 90      O       4892                        Saccharomycetales
 92.46  1391641 0       F       27319                         Metschnikowiaceae
 92.46  1391641 556     G       36910                           Clavispora
 92.13  1386684 1478    G1      1540022                           Clavispora/Candida clade
 90.84  1367342 1367342 S       498019                              [Candida] auris
  0.73  10992   10992   S       45357                               [Candida] haemulonis
  0.46  6872    6872    S       1231522                             [Candida] duobushaemulonis
  0.29  4401    0       S       36911                             Clavispora lusitaniae
  0.29  4401    4401    S1      306902                              Clavispora lusitaniae ATCC 42720
  0.06  872     0       F       766764                        Debaryomycetaceae
  0.06  872     0       F1      1535325                         Candida/Lodderomyces clade
  0.06  872     1       G       1535326                           Candida
  0.03  499     7       S       5476                                Candida albicans
  0.03  492     492     S1      237561                                Candida albicans SC5314
  0.02  348     0       S       5480                                Candida parapsilosis
  0.02  348     348     S1      578454                                Candida parapsilosis CDC317
  0.00  24      0       S       5482                                Candida tropicalis
  0.00  24      24      S1      294747                                Candida tropicalis MYA-3404
  0.00  58      0       F       4893                          Saccharomycetaceae

Issue with masking step

Describe the bug
Used another Pathogen for testing. The next flow run stops at the masking step. Gets an error:

-[CDCgov/mycosnp-nf] Pipeline completed with errors-
Error executing process > 'NFCORE_MYCOSNP:MYCOSNP:BWA_REFERENCE:NUCMER (reference)'

Caused by:
Process NFCORE_MYCOSNP:MYCOSNP:BWA_REFERENCE:NUCMER (reference) terminated with an error exit status (140)

Impact
Pipeline completed with error.

To Reproduce
ran mycosnp command:
nextflow run CDCgov/mycosnp-nf --custom_config_base /scicomp/groups/OID/NCEZID/DFWED/MDB/nf-core-custom -profile singularity,sge --input All_input.csv --fasta GCF_000003855.fasta

Expected behavior
completed the nextflow run without any error.

Logs
If applicable, please attach logs to help describe your problem.

Execution cancelled -- Finishing pending tasks before exit
-[CDCgov/mycosnp-nf] Pipeline completed with errors-
Error executing process > 'NFCORE_MYCOSNP:MYCOSNP:BWA_REFERENCE:NUCMER (reference)'

Caused by:
Process NFCORE_MYCOSNP:MYCOSNP:BWA_REFERENCE:NUCMER (reference) terminated with an error exit status (140)

`mycosnp_create_sample_sheet.pl` does not include header line

Describe the bug
When running the mycosnp_create_sample_sheet.pl script it runs as expected but does not include the header line (sample,fastq_1,fastq_2)

Impact
When using the produced sample sheet, the first line is skipped. This makes the number of samples processed n-1.

To Reproduce

~/repos/mycosnp-nf/bin/mycosnp_create_sample_sheet.pl -i .
SRR7909249,./SRR7909249_1.fastq.gz,./SRR7909249_2.fastq.gz
SRR7909282,./SRR7909282_1.fastq.gz,./SRR7909282_2.fastq.gz

nextflow run ~/repos/mycosnp-nf/main.nf --input ../samples.csv --ref_dir ../docker/reference/GCA_016772135/
-profile docker --skip_phylogeny
....
[1a/4018ee] process > NFCORE_MYCOSNP:MYCOSNP:INPUT_CHECK:LANE_MERGE (SRR7909249)          [100%] 1 of 1 ✔
...

Expected behavior

~/repos/mycosnp-nf/bin/mycosnp_create_sample_sheet.pl -i .
sample,fastq_1,fastq_2
SRR7909249,./SRR7909249_1.fastq.gz,./SRR7909249_2.fastq.gz
SRR7909282,./SRR7909282_1.fastq.gz,./SRR7909282_2.fastq.gz

nextflow run ~/repos/mycosnp-nf/main.nf --input samples.csv --ref_dir ../docker/reference/GCA_016772135/ -profile docker --skip_phylogeny
...
[3f/fe3dc6] process > NFCORE_MYCOSNP:MYCOSNP:INPUT_CHECK:LANE_MERGE (SRR7909282)          [100%] 2 of 2 ✔
...

Screenshots
If applicable, add screenshots to help explain your problem.

Logs
If applicable, please attach logs to help describe your problem.

Desktop (please complete the following information):

  • OS: Debian Buster

Urgent: downsample_rate.nf incorrectly counts total basepairs and reads

Describe the bug
The calcualtions for READS_LEN and NUM_READS in downsample_rate.nf use the @ symbol at the start of the line. Unfortunately, the @ symbol is also a quality score which can occur at the start of line.

Impact
As the number of quality lines that start with @ goes up, READS_LEN becomes more undercounted and NUM_READS becomes more overcounted.

Example

Unaffected FASTQ

@SRR13965548.14
GTCCTAGATGCAGCTCGCCGCCGTGATAATGGCGAAAACATTAGTACAGAAGCATCTTGCGCAAAAATGTTTGCGACTGAAATGTGCGGACGTGTTGCTGACCGCTGTGTACAGATTCACGGTGGTGCGGGCTATATCAGTGAATATGCCA
+
BBBBBFFBF5@45CGB22EEEEEECDHFBHHFBEGEGGHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHGGGGGHHGHHHHHGGGGGGGGHHHHHHHHGGGGGHHHHHHHHHHHHHGGHGGGHGGGGGHHHHHGHGHHHHHGHGHG
@SRR13965548.15
TGGTTTCTCCTTGACTCTCTCTTAATCTAGCTTCAGATCAAGATAAAAAGGAATACTCCATGAAAAAGCCTATTGTTCTCGTACTTTCAACATTAATGTTGGGTATGTCGGCAACTGCGATGGCCGATTCAAATCATCGTTGGGATAACTA
+
AAAAAFFFFFFF11AFEGEGDGHBBGHHBBGHHHHHBHHHHGGHHHHHHGGHHGHHHHHHHHHHHHHGHHHHHHGHHHHHGGGHHHHHHHHHHHHHHHFHHHGGHHHHHHGGGGHHHHGGHGGHHGGGGHHHFHHHHHHHHGHGHHGHHHH

A FASTQ with a @ at the start of a quality line

@SRR13965548.14
GTCCTAGATGCAGCTCGCCGCCGTGATAATGGCGAAAACATTAGTACAGAAGCATCTTGCGCAAAAATGTTTGCGACTGAAATGTGCGGACGTGTTGCTGACCGCTGTGTACAGATTCACGGTGGTGCGGGCTATATCAGTGAATATGCCA
+
@BBBBFFBF5@45CGB22EEEEEECDHFBHHFBEGEGGHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHGGGGGHHGHHHHHGGGGGGGGHHHHHHHHGGGGGHHHHHHHHHHHHHGGHGGGHGGGGGHHHHHGHGHHHHHGHGHG
@SRR13965548.15
TGGTTTCTCCTTGACTCTCTCTTAATCTAGCTTCAGATCAAGATAAAAAGGAATACTCCATGAAAAAGCCTATTGTTCTCGTACTTTCAACATTAATGTTGGGTATGTCGGCAACTGCGATGGCCGATTCAAATCATCGTTGGGATAACTA
+
AAAAAFFFFFFF11AFEGEGDGHBBGHHBBGHHHHHBHHHHGGHHHHHHGGHHGHHHHHHHHHHHHHGHHHHHHGHHHHHGGGHHHHHHHHHHHHHHHFHHHGGHHHHHHGGGGHHHHGGHGGHHGGGGHHHFHHHHHHHHGHGHHGHHHH

Using the calculations for READS_LEN and NUM_READS

READS_LEN=\$(zcat ${reads} | awk '/^@/ {getline; len+=length(\$0)} END {print len}')
NUM_READS=\$(zcat ${reads[0]} | awk '/^@/ {lines+=1} END {print lines}')

With the unaffected fastq we get:

# READS_LEN
cat good.fastq | awk '/^@/ {getline; len+=length($0)} END {print len}'
302

# NUM_READS
cat good.fastq | awk '/^@/ {lines+=1} END {print lines}'
2

For the FASTQ with an @ symbol we get:

# READS_LEN
cat good.fastq | awk '/^@/ {getline; len+=length($0)} END {print len}'
166

# NUM_READS
cat good.fastq | awk '/^@/ {lines+=1} END {print lines}'
3

When a quality line starts with @ instead of using the length of the read, it ends up using the length of the header. In the example above this caused a difference of 136 base pairs between the two FASTQs. This also caused count to increase.

Currently whether --rate and --coverage is used or not, every sample goes through the DOWNSAMPLE_RATE process and is affected by this, making any sort of validations inaccurate.

Reference genome inherited at positions with zero read support

Describe the bug
In the current consensus genome assembly processes, positions with low coverage are being filled in with reference genome rather than being noted as positions of ambiguity.

Impact
Downstream analysis on MycoSNP assemblies may be misleading, especially for samples that have been poorly sequenced.

To Reproduce
Steps to reproduce the behavior:

  • Input raw read data for a different organism than the defined reference genome and the output consensus fasta file will be a near-match to the defined reference genome, e.g. if E. coli read data are used as input with a C. auris reference genome, MycoSNP will generate a C. auris assembly that passes all suggested QC criteria.

Expected behavior
Positions of low coverage should be noted with an N to denote ambiguity at the position rather than assuming a conserved site relative to the reference.

Screenshots
This was made apparent when we began calculating assembly size and number of Ns in the MycoSNP assemblies using the Global Microbial Identifier - WG3 C.auris dataset; all samples from the dataset--as well as a sample of E.coli read data--contained the same exact number of Ns:

Screen Shot 2022-04-18 at 10 48 01 AM

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version v1.1

Additional context
IMHO, this is a critical issue that needs to be addressed as results could be critically misleading for laboratories performing further sample characterization and/or genomic epidemiology analysis using MycoSNP outputs.

Running via docker on Mac M1

When running the pipeline (test) locally on the Mac M1 chip computers, we get the following docker error (well-known by now for M1 users):

Command:
nextflow run CDCgov/mycosnp-nf -profile test,docker
Gives the following error:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Impact
Can't run locally on the laptop!

Desktop (please complete the following information):

  • OS: MacBook Pro Apple M1 Max, Memory: 32 GB

Additional context
Some info online:
https://stackoverflow.com/questions/66662820/m1-docker-preview-and-keycloak-images-platform-linux-amd64-does-not-match-th
https://github.com/google/cadvisor/issues/2763
Happy to help test this with y'all and work together to solve the issue!

Fetch SRA ids from NCBI

If there is a way to fetch SRA ids for each samples which can be used as control samples will be helpful

add checkexists for params.skip_samples_files in the beginning

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Read count differences for few samples

Describe the bug

The read count in the qc_report.txt does not match the real read count of the fastq input files.
This varies from sample to sample and suggests an issue with the way the reads are counted.

Impact
discarding wrong samples for downstream analysis.

To Reproduce
Since the qc_report.txt is created after the downsampling step, the issue lies in the downsampling_rate-nf file. See the screenshot.

Change the command for counting reads from a fastq file in downsampling_rate-nf
:
zcat BXXXXX_1.fastq.gz|awk '/^@/ {lines+=1} END {print lines}'
1616686
to
echo $(zcat BXXXXX_1.fastq.gz|wc -l)/4|bc
1607577

Expected behavior
Correct read counts and other statistics in the qc_report should be helpful in removing bad samples only.

Screenshots
Screenshot (80)

Output vcf-to-fasta sequences of different lengths

in the output vcf-to-fasta.fasta MycoSNP geneflow always produce sequences of exactly the same length, the short sequences provided to test mycosnp-nf also produced alignments with the same length, but with a major number of samples (138), Mycosnp-nf generated sequences of different lengths.
135counts.csv

Updating QC_REPORT with mean depth and % mapped reads

Is your feature request related to a problem? Please describe.
It may be beneficial to include mean depth and percent mapped reads in the QC report to replace or in addition to the coverage metric. These metrics are a better metric to determine the coverage of samples to the reference.

Describe the solution you'd like
One way this could be done:

  • Move the QC_REPORT module down stream in the BWA_PREPROCESS subworkflow.
  • Update the current QC_REPORT Module with inputs from samtools input data from SAMTOOLS_STATS, SAMTOOLS_FLAGSTAT, and or QUALIMAP_BAMQC.
  • Update script in QC_REPORT and QC_REPORTSHEET modules to include these new metrics.

Describe alternatives you've considered
Another alternative is adding a SAMTOOLS_COVERAGE module to calculate mean depth and percent reads mapped. Use the output from this module as the input for the these metrics for the QC_REPORT module.

Additional context

changes to the paths of custom scripts in local modules for cloud computing

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When running the CDCgov/mycosnp-nf v 1.4 pipeline in the cloud (AWS-Batch through Nextflow-tower) an error is raised due to not being able to find the custom scripts used in local modules (i.e. no such file or directory).
I believe this is happening, because $projectDir is used as part of the file path for these scripts. Locally this would be fine.

Describe the solution you'd like
A clear and concise description of what you want to happen.

A potential solution for this issue could be:

  1. Move all custom scripts to the bin folder, if not there already
    - move scripts and contents of broad-vcf-filter folder up one directory into the bin folder.
  2. Execute scripts by name (i.e. qc_report_stats.py) in local modules instead of using python command and path to script (i.e. $projectDir/bin/qc_report_stats.py).
  3. Update any instances of changed file paths for custom scripts.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

This possible solution works locally and in the cloud with aws-batch.
I can submit a PR with these changes if this makes sense.
Please reach out if there is a better solution to this issue and if there is a reason for the current way the custom scripts are called in local modules.

Thanks and sincerely,

CJ

raxml erroring out with `skip_samples` param

Describe the bug

Error executing process > 'NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG (1)'

Caused by:
  Process `NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG (1)` terminated with an error exit status (1)

Command executed:

  raxml-ng \
      --all --model GTR+G --bs-trees 1000 \
      --msa combined_vcf-to-fasta.fasta \
      --threads 2 \
      --prefix output
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_MYCOSNP:MYCOSNP:CREATE_PHYLOGENY:RAXMLNG":
      raxmlng: $(echo $(raxml-ng --version 2>&1) | sed 's/^.*RAxML-NG v. //; s/released.*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  
  RAxML-NG v. 1.0.3 released on 21.07.2021 by The Exelixis Lab.
  Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
  Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
  Latest version: https://github.com/amkozlov/raxml-ng
  Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml
  
  System: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz, 4 cores, 7 GB RAM
  
  RAxML-NG was called at 09-Mar-2022 18:44:28 as follows:
  
  raxml-ng --all --model GTR+G --bs-trees 1000 --msa combined_vcf-to-fasta.fasta --threads 2 --prefix output
  
  Analysis options:
    run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
    start tree(s): random (10) + parsimony (10)
    bootstrap replicates: 1000
    random seed: 1646869468
    tip-inner: OFF
    pattern compression: ON
    per-rate scalers: OFF
    site repeats: ON
    branch lengths: proportional (ML estimate, algorithm: NR-FAST)
    SIMD kernels: AVX
    parallelization: coarse-grained (auto), PTHREADS (2 threads), thread pinning: OFF
  
  [00:00:00] Reading alignment from file: combined_vcf-to-fasta.fasta
  [00:00:00] Loaded alignment with 3 taxa and 45 sites
  
  ERROR: Your alignment contains less than 4 sequences! 
  
  ERROR: Alignment check failed (see details above)!

Command wrapper:
  
  RAxML-NG v. 1.0.3 released on 21.07.2021 by The Exelixis Lab.
  Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
  Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
  Latest version: https://github.com/amkozlov/raxml-ng
  Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml
  
  System: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz, 4 cores, 7 GB RAM
  
  RAxML-NG was called at 09-Mar-2022 18:44:28 as follows:
  
  raxml-ng --all --model GTR+G --bs-trees 1000 --msa combined_vcf-to-fasta.fasta --threads 2 --prefix output
  
  Analysis options:
    run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
    start tree(s): random (10) + parsimony (10)
    bootstrap replicates: 1000
    random seed: 1646869468
    tip-inner: OFF
    pattern compression: ON
    per-rate scalers: OFF
    site repeats: ON
    branch lengths: proportional (ML estimate, algorithm: NR-FAST)
    SIMD kernels: AVX
    parallelization: coarse-grained (auto), PTHREADS (2 threads), thread pinning: OFF
  
  [00:00:00] Reading alignment from file: combined_vcf-to-fasta.fasta
  [00:00:00] Loaded alignment with 3 taxa and 45 sites
  
  ERROR: Your alignment contains less than 4 sequences! 
  
  ERROR: Alignment check failed (see details above)!

Work dir:
  /mycosnp-nf/work/b3/8cb204c43b712f83738a16d8ed3b6a

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Impact

To Reproduce
Steps to reproduce the behavior:
nextflow run main.nf -profile singularity,test --skip_samples ERR2172266 -resume

Read count for the same sample varies in the qc_report

Describe the bug
When the same samples were used for two different runs, the number of reads before trimming and after trimming varies between the runs

Impact
The filtered read counts used for downstream alignment and the variant call will not be correct

To Reproduce

Ran the command to run mycosnp/nextflow and checked the qc_report.txt file.

Screenshot (71)

Expected behavior
The qc_report for these samples from both the runs should be identical as the same reference was used.

Screenshots

Screenshot (72)

qc_report_run1.txt

qc_report_run2.txt

FEATURE ADD: multiple workflow entry points

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Prefer system TMPDIR over project directory?

Currently, mycosnp-nf defaults --tmpdir to "$projectDir/tmp", which in cloud settings might be on a space limited partition.

I realize I can just do --tmpdir $TMPDIR , but would you consider setting the default to the system wide environment variable and falling back on the project directory in the event TMPDIR is not set?

Cheers,
Robert

Add snpEff sub-workflow for variant annotation

Is your feature request related to a problem? Please describe.
Addition of new feature to create richer, more informative output repots from variant calling steps of MycoSNP

Describe the solution you'd like
Would like to annotate potential variants of interest in CDS regions.

Change default value of --min_depth

Is your feature request related to a problem? Please describe.
The min_depth parameter is fixed as 50. After a number of tests, we realized this value is very conservative and prevents a number of SNPs from including in the final vcf-to-fasta file. Eventually, this is affecting the final tree output.

Describe the solution you'd like
Change the default value of the --min_depth parameter to 10.

update configuration file for vcftools_filter clarity

The current nextflow.config lists filter criteria for filterGatkGenotypes.py under the param vcftools_filter. It would be good to have a separate gatkgenotypes_filter so that advanced users know which syntax to use for filtering.

// Mycosnp options
save_reference             = true
save_alignment             = true
sample_ploidy              = 1
coverage                   = 70 // Desired sample coverage used in seqtk_sample
rate                       = ''
gvcfs_filter               = 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || DP < 10'
vcftools_filter            = '--min_GQ "50" --keep_GQ_0_refs --min_percent_alt_in_AD "0.8" --min_total_DP "10" --keep_all_ref'
max_amb_samples            = 10000000
max_perc_amb_samples       = 10
publish_dir_mode           = 'copy'
rapidnj                    = true
fasttree                   = true
iqtree                     = false
raxmlng                    = false
save_debug                 = false
mask                       = true
tmpdir                     = "$projectDir/tmp"
skip_samples               = ""
skip_samples_file          = null
skip_combined_analysis     = false
skip_phylogeny             = false

This would result to two params options a vcftools_filter and a gatkgenotypes_filter and provide a better insight into how the filter criteria is being applied within the workflow.

modify `DOWNSAMPLE_RATE` module to `shell` block from `script`

Caused by:
Process `NFCORE_MYCOSNP:MYCOSNP:BWA_PREPROCESS:DOWNSAMPLE_RATE (B16329_S8)` terminated with an error exit status (2)



Command executed:



REFERENCE_LEN=$(awk '!/^>/ {len+=length($0)} END {print len}' < reference.fa)
READS_LEN=$(zcat B16329_S8_R1.paired.fastq.gz B16329_S8_R2.paired.fastq.gz | awk '/^@/ {getline; len+=length($0)} END {print len}')
SAMPLE_RATE=$(echo "70 ${READS_LEN} ${REFERENCE_LEN}" | awk '{x=$1/($2/$3); x=(1<x?1:x)} END {print x}')



# Calculate number of reads
NUM_READS=$(zcat B16329_S8_R1.paired.fastq.gz | awk '/^@/ {lines+=1} END {print lines}')
SAMPLED_NUM_READS=$(echo "${NUM_READS} ${SAMPLE_RATE}" | awk '{x=$1*$2} END {printf "%.0f", x}')



Command exit status:
2



Command output:
(empty)



Command error:
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

seen when samples are located in a GWA

related module: DOWNSAMPLE_RATE

also recommend switching from script to shell in the process block for DOWNSAMPLE_RATE

Default params.rate runs unneeded processes

Describe the bug
The nextflow.config defines params.rate = 1; this parameter effects the behavior of the local module DOWNSAMPLE_RATE which also sends variables to SEQTK_SAMPLE to subsample the fastq files. When params.rate = 1 DOWNSAMPLE_RATE tells SEQTK_SAMPLE to randomly sample the exact number of reads available.

Impact
At the very least this is wasted computation that should be avoided. This has also caused some cloud computing problems for my organization, UPHL, where the SEQTK_SAMPLE process is using lots of resources causing time and cost issues.

To Reproduce
This is the default behavior of the workflow; using the nextflow.config file.

Expected behavior
I would think that the default config does not need to be change but how the sub-workflow runs could avoid this issue. Something like:
mycosnp-nf/subworkflows/local/bwa-pre-process.nf
line 43-50

SEQKIT_PAIR(reads)
if (params.rate == 1) {
FAQCS(SEQKIT_PAIR.out.reads)
}
else {
DOWNSAMPLE_RATE(SEQKIT_PAIR.out.reads, reference[0], params.coverage, params.rate)

ch_seq_samplerate = SEQKIT_PAIR.out.reads.join(DOWNSAMPLE_RATE.out.downsampled_rate.map{ meta, sr, snr -> [ meta, snr]})
    
SEQTK_SAMPLE(ch_seq_samplerate)
FAQCS(SEQTK_SAMPLE.out.reads)
}
BWA_MEM(FAQCS.out.reads, reference[2], true)

Lines 69 to 80 would also need to be modified because SEQTK_SAMPLE.out.versions would be undefined, maybe SEQTK_SAMPLE.out.versions.ifEmpty('NA').

Additional Details
If my understanding is correct and you agree with the issue; I would be happy to submit a PR to address it. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.