Snipe

Snipe: Highly sensitive pathogen detection from metagenomic sequencing data

Introduction

Snipe (SeNsItive Pathogen dEtection), a pipeline for improving the ability of existing strain-typing tools to detect common pathogens from contaminated food samples at low abundances. Snipe consists of three core modules:

The snipeMap module will map unassembled metagenomic reads against a target library and remove sequences that align to the filter and host libraries.
The snipeId module will reassign ambiguous reads, identify microbial strains present in the sample, and estimate proportions of reads from each genome.
The snipeRec module will align raw reads to SSRs and generate reports containing read proportions to each genome after rectification by the a posteriori probabilities.

Support and Contact

For any issues or concerns, please contact us at [email protected]

Pathogenic Species Supported

Species name	Number of SSRs
Escherichia coli	98
Salmonella enterica	387
Staphylococcus aureus	169
Listeria monocytogenes	157
Campylobacter jejuni	132
Vibrio cholerae	261
Vibrio parahaemolyticus	1206
Proteus mirabilis	1122
Yersinia enterocolitica	141
Clostridium perfringens	2377

Software Dependencies

It is recommended to create a new conda environment:

conda create -n python37 python=3.7

# Activate this environment:
conda activate python37

   • numpy (v1.15.0)
        conda install -c conda-forge numpy
   • pandas (v0.24.2)
        conda install -c conda-forge pandas
   • bowtie2 (v2.2.5)
        conda install -c bioconda bowtie2
   • pysam (v0.15.3)
        conda install -c bioconda pysam

Manual

First of all, we should:

change directory (cd) to snipe folder
cd into snipe directory and call snipeIndex module help for details
```
cd ../snipe
python snipe.py -h
```

MAP

We need the database of strains, which can be downloaded from NCBI. First you need to make sure that the index has been established otherwise the software will take a moment to build the index.

call snipeMap module help for details

python snipe.py MAP -h

python snipe.py MAP -1 map_inputread1 -2 map_inputread2 -targetRefFiles map_targetRef -filterRefFiles map_filterRef -indexDir map_indexdir -outDir map_outdir -
outAlign map_outalign -expTag map_exp_tag  -numThreads map_numthreads

Required arguments:

-1,              string                    Input Read Fastq File (Pair 1)

-2,              string                    Input Read Fastq File (Pair 2)

-targetRefFiles, string                    Target Reference Genome Fasta Files Full Path (Comma Separated)

-filterRefFiles, string                    Filter Reference Genome Fasta Files Full Path (Comma Separated)

-outAlign,       string                    Output Alignment File Name (Default=outalign.sam)

-expTag,         string                    Experiment Tag added to files generated for identification



Optional arguments:

-outDir,         string                    Output Directory (Default=. (current directory))

-indexDir,       string                    index directory (default=. (current directory))

-numThreads,     int                       Number of threads to use by aligner (bowtie2) if different from default (8)

ID

First you need to make sure that the map module is finished. ID module will use file .sam generated previously with MAP module.

call snipeId module help for details

python snipe.py ID -h

python snipe.py ID -outDir id_outdir -alignFile id_ali_file -expTag id_exp_tag

Required arguments:

-alignFile,      string                    Alignment file path

-expTag,         string                    Experiment tag added to output file for easy identification

Optional arguments:

-outDir,         string                    Output Directory (Default=. (current directory))

REC

Make sure the SSRs index has been established.

call snipeRec module help for details

python snipe.py REC -h

python snipe.py REC -ssrRef map_ssrRefDir -1 rec_inputread1 -2 rec_inputread2 -idReport id_ali_file -dictTarget targetInfo_dict -dictTemplate file3 -outDir path2 -numThreads 1

Required arguments:

-ssrRef,         string                    the directory of the species specific regions

-1,              string                    Input Read Fastq File (Pair 1)

-2,              string                    Input Read Fastq File (Pair 2)

-idReport,       string                    alignment file generated by ID module

-dictTarget,     string                    the dict which contains accession id to species name

-dictTemplate,   string                    the dict which contains accession id to strain name

-expTag,         string                    Experiment tag added to output file for easy identification

Optional arguments:

-outDir,         string                    Output Directory (Default=.(current directory))

-numThreads,     int                       Number of threads to use default (1)

Step-by-step example

0. [Make sure you have all the ingredients]

bowtie2 --version
python -V
import pysam, pandas, numpy
pysam.__version__
pandas.__version__
numpy.__version__

1. [The SnipeMap module]

python ./snipe/snipe.py MAP -1 example/demo_R1.fastp35.fastq -2 example/demo_R2.fastp35.fastq -targetRefFiles ./refDB/target.fna -filterRefFiles ./refDB/filter.fna -indexDir ./refDB/ -outDir ./ -outAlign demo.sam -expTag demo -numThreads 44

2. [The SnipeID module]

python ./snipe/snipe.py ID -alignFile ./demo.sam -fileType sam -outDir ./ -expTag demo

3. [The SnipeRec module]

python ./snipe/snipe.py REC -ssrRef ./core/ -1 ./example/demo_R1.fastp35.fastq -2 ./example/demo_R2.fastp35.fastq -idReport demo-sam-report.tsv -dictTarget ./dict/dict_target -dictTemplate ./dict/dict_template -expTag demo -outDir ./ -numThreads 44

Output TSV file format

Columns in the TSV file:

1.Genomes:

This is the name of the genome found in the alignment file.

2.Accession ID:

Accession ID used by NCBI Genebank database.

3.Rectified Final Guess:

This represents the percentage of reads that are mapped to the genome in Column 1 after using SSRs rectification.

4.Final Guess:

This represents the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) after reassignment is performed.

5.Rectified Probability:

This represents probability after using SSRs rectification .

6.SSR Aligned Reads:

This represents the number of reads that are mapped to the SSRs.

7.Rectified Abundance:

This represents the abundance after using SSRs rectification.

8.Initial Abundance:

This represents the abundance before using SSRs rectification.

9.Final Best Hit:

This represents the percentage of reads that are mapped to the genome in Column 1 after assigning each read uniquely to the genome with the highest score and after pathoscope reassignment is performed.

10.Final Best Hit Read Numbers:

This represents the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and after pathoscope reassignment is performed.

yi1873 / snipe Goto Github PK

snipe's Introduction