Git Product home page Git Product logo

snipe's Introduction

Snipe

Snipe: Highly sensitive pathogen detection from metagenomic sequencing data

Introduction

Snipe (SeNsItive Pathogen dEtection), a pipeline for improving the ability of existing strain-typing tools to detect common pathogens from contaminated food samples at low abundances. Snipe consists of three core modules:

  • The snipeMap module will map unassembled metagenomic reads against a target library and remove sequences that align to the filter and host libraries.
  • The snipeId module will reassign ambiguous reads, identify microbial strains present in the sample, and estimate proportions of reads from each genome.
  • The snipeRec module will align raw reads to SSRs and generate reports containing read proportions to each genome after rectification by the a posteriori probabilities.

Support and Contact

For any issues or concerns, please contact us at [email protected]

Pathogenic Species Supported

Species name Number of SSRs
Escherichia coli 98
Salmonella enterica 387
Staphylococcus aureus 169
Listeria monocytogenes 157
Campylobacter jejuni 132
Vibrio cholerae 261
Vibrio parahaemolyticus 1206
Proteus mirabilis 1122
Yersinia enterocolitica 141
Clostridium perfringens 2377

Software Dependencies

It is recommended to create a new conda environment:

conda create -n python37 python=3.7

# Activate this environment:
conda activate python37
   • numpy (v1.15.0)
        conda install -c conda-forge numpy
   • pandas (v0.24.2)
        conda install -c conda-forge pandas
   • bowtie2 (v2.2.5)
        conda install -c bioconda bowtie2
   • pysam (v0.15.3)
        conda install -c bioconda pysam 

Manual

First of all, we should:

  • change directory (cd) to snipe folder
  • cd into snipe directory and call snipeIndex module help for details
    cd ../snipe
    python snipe.py -h
    

MAP

We need the database of strains, which can be downloaded from NCBI. First you need to make sure that the index has been established otherwise the software will take a moment to build the index.

call snipeMap module help for details

python snipe.py MAP -h

python snipe.py MAP -1 map_inputread1 -2 map_inputread2 -targetRefFiles map_targetRef -filterRefFiles map_filterRef -indexDir map_indexdir -outDir map_outdir -
outAlign map_outalign -expTag map_exp_tag  -numThreads map_numthreads

Required arguments:

-1,              string                    Input Read Fastq File (Pair 1)

-2,              string                    Input Read Fastq File (Pair 2)

-targetRefFiles, string                    Target Reference Genome Fasta Files Full Path (Comma Separated)

-filterRefFiles, string                    Filter Reference Genome Fasta Files Full Path (Comma Separated)

-outAlign,       string                    Output Alignment File Name (Default=outalign.sam)

-expTag,         string                    Experiment Tag added to files generated for identification



Optional arguments:

-outDir,         string                    Output Directory (Default=. (current directory))

-indexDir,       string                    index directory (default=. (current directory))

-numThreads,     int                       Number of threads to use by aligner (bowtie2) if different from default (8)

ID

First you need to make sure that the map module is finished. ID module will use file .sam generated previously with MAP module.

call snipeId module help for details

python snipe.py ID -h

python snipe.py ID -outDir id_outdir -alignFile id_ali_file -expTag id_exp_tag

Required arguments:

-alignFile,      string                    Alignment file path

-expTag,         string                    Experiment tag added to output file for easy identification

Optional arguments:

-outDir,         string                    Output Directory (Default=. (current directory))

REC

Make sure the SSRs index has been established.

call snipeRec module help for details

python snipe.py REC -h

python snipe.py REC -ssrRef map_ssrRefDir -1 rec_inputread1 -2 rec_inputread2 -idReport id_ali_file -dictTarget targetInfo_dict -dictTemplate file3 -outDir path2 -numThreads 1

Required arguments:

-ssrRef,         string                    the directory of the species specific regions

-1,              string                    Input Read Fastq File (Pair 1)

-2,              string                    Input Read Fastq File (Pair 2)

-idReport,       string                    alignment file generated by ID module

-dictTarget,     string                    the dict which contains accession id to species name

-dictTemplate,   string                    the dict which contains accession id to strain name

-expTag,         string                    Experiment tag added to output file for easy identification

Optional arguments:

-outDir,         string                    Output Directory (Default=.(current directory))

-numThreads,     int                       Number of threads to use default (1)

Step-by-step example

0. [Make sure you have all the ingredients]

bowtie2 --version
python -V
import pysam, pandas, numpy
pysam.__version__
pandas.__version__
numpy.__version__

1. [The SnipeMap module]

python ./snipe/snipe.py MAP -1 example/demo_R1.fastp35.fastq -2 example/demo_R2.fastp35.fastq -targetRefFiles ./refDB/target.fna -filterRefFiles ./refDB/filter.fna -indexDir ./refDB/ -outDir ./ -outAlign demo.sam -expTag demo -numThreads 44

2. [The SnipeID module]

python ./snipe/snipe.py ID -alignFile ./demo.sam -fileType sam -outDir ./ -expTag demo

3. [The SnipeRec module]

python ./snipe/snipe.py REC -ssrRef ./core/ -1 ./example/demo_R1.fastp35.fastq -2 ./example/demo_R2.fastp35.fastq -idReport demo-sam-report.tsv -dictTarget ./dict/dict_target -dictTemplate ./dict/dict_template -expTag demo -outDir ./ -numThreads 44

Output TSV file format

Columns in the TSV file:

1.Genomes:

This is the name of the genome found in the alignment file.

2.Accession ID:

Accession ID used by NCBI Genebank database.

3.Rectified Final Guess:

This represents the percentage of reads that are mapped to the genome in Column 1 after using SSRs rectification.

4.Final Guess:

This represents the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) after reassignment is performed.

5.Rectified Probability:

This represents probability after using SSRs rectification .

6.SSR Aligned Reads:

This represents the number of reads that are mapped to the SSRs.

7.Rectified Abundance:

This represents the abundance after using SSRs rectification.

8.Initial Abundance:

This represents the abundance before using SSRs rectification.

9.Final Best Hit:

This represents the percentage of reads that are mapped to the genome in Column 1 after assigning each read uniquely to the genome with the highest score and after pathoscope reassignment is performed.

10.Final Best Hit Read Numbers:

This represents the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and after pathoscope reassignment is performed.

snipe's People

Contributors

duolabuaimeng avatar zixunzuihao avatar rongshanyu avatar lemonhlh avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.