Git Product home page Git Product logo

mlst-nf's Introduction

Tests

mlst-nf

A nextflow pipeline for running mlst on a set of assemblies.

flowchart TD
  assembly --> quast(quast)
  quast --> assembly_qc
  assembly --> mlst(mlst)
  mlst --> mlst.json
  mlst --> parse_alleles(parse_alleles)
  parse_alleles --> alleles.csv
  parse_alleles --> sequence_type.csv

Usage

nextflow run BCCDC-PHL/mlst-nf \
  --assembly_input </path/to/assemblies> \
  --outdir </path/to/outdir>

The pipeline also supports a 'samplesheet input' mode. Pass a samplesheet.csv file with the headers ID, ASSEMBLY:

nextflow run BCCDC-PHL/mlst-nf \
  --samplesheet_input </path/to/samplesheet.csv> \
  --outdir </path/to/outdir>

Outputs

Outputs for each sample will be written to a separate directory under the output directory, named using the sample ID.

The following output files are produced for each sample.

sample-01
├── sample-01_20211202154752_provenance.yml
├── sample-01_alleles.csv
├── sample-01_mlst.json
└── sample-01_sequence_type.csv

If the --versioned_outdir flag is used, then a sub-directory will be created below each sample, named with the pipeline name and minor version:

sample-01
    └── mlst-nf-v0.1-output
        ├── sample-01_20211202154752_provenance.yml
        ├── sample-01_alleles.csv     
        ├── sample-01_mlst.json	      
        └── sample-01_sequence_type.csv

This is provided as a way of combining outputs of several different pipelines or re-analysis with future versions of this pipeline:

sample-01
    └── mlst-nf-v0.1-output
    │   ├── sample-01_20211202154752_provenance.yml
    │   ├── sample-01_alleles.csv
    │   ├── sample-01_mlst.json
    │   └── sample-01_sequence_type.csv
    └── mlst-nf-v0.2-output
        ├── sample-01_20220321113128_provenance.yml
        ├── sample-01_alleles.csv
        ├── sample-01_mlst.json
        └── sample-01_sequence_type.csv

The mlst.json output is generated directly by the mlst tool. It has the following format:

[
   {
      "scheme" : "sepidermidis",
      "alleles" : {
         "mutS" : "1",
         "yqiL" : "1",
         "tpiA" : "1",
         "pyrR" : "2",
         "gtr" : "2",
         "aroE" : "1",
         "arcC" : "16"
      },
      "sequence_type" : "184",
      "filename" : "test/example.gbk.gz",
      "id" : "test/example.gbk.gz"
   }
]

The alleles.csv file is generated based on the .json output, and includes a couple of boolean (True/False) fields to indicate whether the allele is a perfect match, or if it is a novel allele, based on the presence of ? or ~ characters in the allele calls, as described here.

The per-locus score field is computed based on the rules described here.

The fields in in the alleles.csv output are:

sample_id
scheme
locus
allele
perfect_match
novel_allele
score

The sequence_type.csv file includes an overall sequence type ID based on the allele calls for each locus, and the overall score, which is simply the sum of the per-locus scores for the sample.

sample_id
scheme
sequence_type
score

Provenance

Each analysis will create a provenance.yml file for each sample. The filename of the provenance.yml file includes a timestamp with format YYYYMMDDHHMMSS to ensure that a unique file will be produced if a sample is re-analyzed and outputs are stored to the same directory.

- pipeline_name: BCCDC-PHL/mlst-nf
  pipeline_version: 0.1.4
  nextflow_session_id: f18b89aa-06f7-41e4-b016-3519dfd5a5cb
  nextflow_run_name: sharp_bhaskara
  timestamp_analysis_start: 2024-02-20T22:59:37.862710
- input_filename: NC-000913.3.fa
  input_path: /home/runner/work/mlst-nf/mlst-nf/.github/data/assemblies/NC-000913.3.fa
  sha256: 6b195feda4c66140f6762742eb8b30c2652f02b45878b174f5b00ef85ecc95d7
- process_name: mlst
  tools:
    - tool_name: mlst
      tool_version: 2.16.1
      parameters:
      - parameter: minid
        value: 95
      - parameter: mincov
        value: 10
      - parameter: minscore
        value: 50
- process_name: quast
  tools:
    - tool_name: quast
      tool_version: 5.0.2
      parameters:
        - parameter: --space-efficient
          value: null
        - parameter: --fast
          value: null
        - parameter: --min-contig
          value: 0

mlst-nf's People

Contributors

dfornika avatar

Stargazers

 avatar

Watchers

 avatar

mlst-nf's Issues

Pipeline fails on low-quality assembly

Quast will fail when given assemblies with no contig greater than 500bp, which causes the pipeline to fail. One poor-quality sample could crash a full run, so it would make the overall pipeline more robust if we can prevent the pipeline from crashing in the presence of a single low-quality sample.

Adopt nf-core conventions

In anticipation of integrating with tools and platforms like Sequera Platform we'd like to evaluate what would be necessary to adopt the nf-core conventions for our existing pipelines. Since this is a fairly simple pipeline, it's a good candidate for conversion to nf-core.

Add optional versioned output directory

The pipeline currently creates one output directory per sample and publishes all outputs there. eg:

publishDir "${params.outdir}/${sample_id}", mode: 'copy', pattern: "${sample_id}_mlst.json"

When combining this pipeline with others, it may be useful to encapsulate the outputs from this pipeline in a sub-directory that is named with the pipeline name and version.

So by default we would create outputs of this structure:

.
├── sample-01
│   ├── sample-01_alleles.csv
│   └── sample-01_sequence_type.csv
├── sample-02
│   ├── sample-02_alleles.csv
│   └── sample-02_sequence_type.csv
└── sample-03
    ├── sample-03_alleles.csv
    └── sample-03_sequence_type.csv

...but when running with a --versioned_outdir flag , we would produce:

.
├── sample-01
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
├── sample-02
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
└── sample-03
    └── mlst-nf-v0.1-output
        ├── sample-01_alleles.csv
        └── sample-01_sequence_type.csv
 

...then a subsequent analysis could produce similar outputs alongside:

.
├── sample-01
│   ├── mlst-nf-v0.1-output
│   │   └── sample-01_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-01_bakta.gbk
│       └── sample-01_unicycler.fa
├── sample-02
│   ├── mlst-nf-v0.1-output
│   │   └── sample-02_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-02_bakta.gbk
│       └── sample-02_unicycler.fa
└── sample-03
    ├── mlst-nf-v0.1-output
    │   └── sample-03_mlst.csv
    └── routine-assembly-v0.2-output
        ├── sample-03_bakta.gbk
        └── sample-03_unicycler.fa

Make input QC optional

There are cases where we run this pipeline on the outputs of another pipeline (generally BCCDC-PHL/routine-assembly. That pipeline may already perform QC on its outputs, so running essentially the same QC on the inputs of this pipeline would be redundant.

Add a --skip_input_qc flag that causes the QUAST analysis on the input assemblies to be skipped.

Add support for `--collect_outputs`

We currently only generate a separate output directory for each sample. But it would be convenient to collect the sequence types for all samples into a single .csv file as well. The user should be able to specify a prefix for the collected outputs, using a --collected_outputs_prefix flag, whose default value is collected.

Remove `versioned_outdir` param

The versioned_outdir param hasn't proven to be useful, and it clutters up our publishDir directives.

Remove the versioned_outdir param.

`parse_alleles.py` fails when no alleles included in mlst output

Command error:
  Traceback (most recent call last):
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 78, in <module>
      main(args)
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 29, in main
      num_alleles = len(mlst[sample]['alleles'])
  TypeError: object of type 'NoneType' has no len()

json output from mlst was:

{
   "sample-X.fa" : {
      "scheme" : "-",
      "sequence_type" : "-",
      "alleles" : null,
      "filename" : "sample-X.fa"
   }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.