smangul1 / micop Goto Github PK

MiCoP is a method for high-accuracy profiling of viral and fungal metagenomic communities.

License: GNU General Public License v3.0

Python 13.63% JavaScript 39.47% Roff 28.28% Perl 14.48% Shell 4.15%

micop's Introduction

MiCoP: Microbial Community Profiling method for detecting viral and fungal organisms in metagenomic samples

MiCoP is a method for high-accuracy profiling of viral and fungal metagenomic communities. MiCoP uses a mapping-based approach that produces more accurate results relative to state of the art general purpose metagenomic profiling methods, which tend to focus mostly on bacteria. MiCoP is compatible with both single and paired end reads. This repository is set up to use BWA to map reads to the full NCBI Virus and Fungi RefSeq databases, and then filter and profile these results using the compute-abundances.py script. Usage is simple and details are provided below. For more details on MiCoP, see the paper link below.

Synopsis

git clone https://github.com/smangul1/MiCoP.git
cd MiCoP/
./setup.sh
python run-bwa.py reads.fq [--virus OR --fungi] --output alignments.sam
python compute-abundances.py alignments.sam [--virus OR --fungi]

Where [--virus OR --fungi] means you use either the --virus flag or the --fungi flag (but not both or neither) depending on whether you want to profile viruses or fungi.

Paper

The MiCoP manuscript is available online at: https://doi.org/10.1186/s12864-019-5699-9

If you use MiCoP, please cite the paper. For instance:

LaPierre, N., Mangul, S., Alser, M. et al. MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples. BMC Genomics 20, 423 (2019). https://doi.org/10.1186/s12864-019-5699-9

Download and Setup MiCoP

Simply run:

git clone https://github.com/smangul1/MiCoP.git
cd MiCoP/
./setup.sh

This will download this repository, change into the directory, and run the setup script. The setup script will download about 8.4 GB of data, which will be uncompressed into files totalling about 13 GB of disk space. These are the pre-indexed NCBI Virus and Fungi databases for BWA. It is not entirely necessary to use these specific databases, but it will greatly simplify running MiCoP and is strongly recommended.

This step should take about 5-30 minutes, depending on your network connection.

Mapping stage

MiCoP is a mapping-based method that expects output from a mapping tool in SAM format. For the mapping tool, we recommend BWA. To eliminate confusion over which databases to use, how to extract necessary information from them, which alignment tool to use, and how to use it, we have included a copy of the BWA alignment tool (see License info) and a script to run it using the settings expected by MiCoP. This script is called run-bwa.py.

Note: Runtime for the BWA alignment step dominates the total computation time. For large reads files, this may take a while.

Basic usage is very simple:

python run-bwa.py reads.fq [--virus OR --fungi] --output alignments.sam

The reads are required, as are either --virus or --fungi, while the --output flag is optional. The default output file name is alignments.sam. If you are using paired end reads, use the same format as above except with two reads files ("python run-bwa.py reads1.fq reads2.fq" etc). If you have a single interleaved paired ends file, use the above format and also add the "--paired" flag.

If you would like to run BWA manually for more flexibility, make sure you use the -a flag, and see the following manpage for BWA: http://bio-bwa.sourceforge.net/bwa.shtml#13

Details on the basic BWA algorithm can be found in the BWA paper cited below. We specifically use the BWA mem algorithm, cited below BWA.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754-1760.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.

Abundance profiling Script

The script for computing abundance profiling is called compute-abundance.py, and is very simple to use:

python compute-abundances.py alignments.sam [--virus OR --fungi] [options]

This script should take a small fraction of the time it takes to run BWA mapping.

By default, this will output the results to abundances.txt. To change this, use the --output flag. All of the other options have to do with adjusting the filtering criteria for counting a clade as being present, and these options and their descriptions can be viewed with:

python compute-abundances.py -h

Simulated data

The data used for our simulations can be found in the simulated_data directory in a compressed format (.bz2). If you want to run BWA or the run-bwa.py wrapper script on this data, you will have to decompress it first. For instance, running "bzip2 -dk grinder-fungi-low-complexity-reads.fa.bz2" will generate the file grinder-fungi-low-complexity-reads.fa, which you can then run the scripts on. Just do "bzip2 -d" without the "k" if you only want the unpacked file and do not want to keep the original compressed file.

The mock community datasets are taken from the following papers:

Viral data: Conceicao-Neto, N., Zeller, M., Lefrere, H., De Bruyn, P., Beller, L., Deboutte, W., ... & Matthijnssens, J (2015). Modular approach to customise sample preparation procedures for viral metagenomics: a reproducible protocol for virome analysis. Scientific reports, 5, e16532. doi:10.1038/srep16532

Fungi data: Tonge, D. P., Pashley, C. H., & Gant, T. W. (2014). Amplicon–Based Metagenomic Analysis of Mixed Fungal Samples Using Proton Release Amplicon Sequencing. PloS one, 9(4), e93849. https://doi.org/10.1371/journal.pone.0093849

Finally, the HMP data was downloaded using the instructions in the MetaPhlAn tutorial: https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial

Raw FASTA Database Files

Download links to the raw FASTA databases we used for this paper are below.

Viruses: https://drive.google.com/file/d/1qQGddyD4Zkj-LaT6vLe9kP4uDCBJHH85/view?usp=sharing

Fungi: https://drive.google.com/file/d/14E0G0w_gUJizYsjyLUNzd7J7frrMaeMJ/view?usp=sharing

License Info

For convenience, we include a version of the BWA kit in our repository. This is allowed under BWA's GPLv3 license, under the requirement that we also adopt GPLv3. We also use a code snippet in the setup script that is adapted from a Stack Overflow answer, permitted for inclusion under the Creative Commons license.

micop's People

Contributors

Stargazers

Watchers

Forkers

theo-allnutt-bioinformatics pythseq carden24 qxibai chen318liang arunabio anyihu ayixon alienzj

micop's Issues

Raw read counts

Hi, I am interested in comparing the number of reads assigned to bacteria and fungi in human samples. I have used metaphlan2 for bacterial annotation and I was hoping o use MiCoP for fungal annotation. Does the MiCoP output provide raw sequence counts associated with each taxa or is it just relative abundance values?

compute-abundances error

I'm getting the following error. How can I fix it? Thanks!

python compute-abundances.py -h
File "compute-abundances.py", line 282
lev_res = {i:[] for i in range(len(RANKS))}
^
SyntaxError: invalid syntax

Fungal / viral database

Hi,

can you please provide the Entrez terms used to download your databases, or provide the raw fasta files?

Thanks,

No fungal hits found using fungal input. Better reporting output needed ?

Hi,

I'm interested in using Micop for detection of fungal reads.

However, it has not reported a single (even false positive) hit across many metagenomes so far. So I tried fungal input, which also failed to give a hit.

SRR610531.fastq.gz

Command:


cpus=8
# Run

echo $fastq
python3 /mnt/ngsnfs/tools/MiCoP/run-bwa.py --fungi --threads $cpus --output $fastq.micop.sam $fastq
python3 /mnt/ngsnfs/tools/MiCoP/compute-abundances.py --max_ed 2 --pct_id 0.9 --fungi $fastq.micop.sam --output $fastq.micop.abund.csv
#rm $fastq.micop.sam

#usage: compute-abundances.py [-h] [--fungi] [--min_map MIN_MAP]

The SAM files do contain hits:

SRR610531.4029  256     NW_015378169.1  64349   0       40M1I10M        *       0       0       *       *       NM:i:3  MD:Z:0G16G32    AS:i:37
SRR610531.4032  0       NW_001884661.1  2913737 0       51M     *       0       0       TCTCGTGTTCAAGATAGCACACCATTATGTCGATGCTCTTGGGTGTGACAT     DDHHFFFAEH

But abund.csv is always empty:


 cat SRR610531.fastq.gz.micop.abund.csv
@@TAXID RANK    TAXPATH TAXPATHSN       PERCENTAGE

Data directory seems ok

MiCoP$ ls data/

accession2info-fungi.txt  fungi-refseq.fna.amb  fungi-refseq.fna.bwt  fungi-refseq.fna.sa   viral-refseq.fna.ann  viral-refseq.fna.pac
accession2info-viral.txt  fungi-refseq.fna.ann  fungi-refseq.fna.pac  viral-refseq.fna.amb  viral-refseq.fna.bwt  viral-refseq.fna.sa

So this is a feature request, it makes sense to report how many raw hits were found and at which stage they were filtered out during the reporting.

Thanks

Indices.zip impossible to download during setup

Hello,

The tools appears to be outdated. I get this weird error when executing setup.sh:

Archive: indices.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of indices.zip or indices.zip.zip, and cannot find indices.zip.ZIP, period.

I tried downloading it manually but it seems it is not on google drive anymore (too heavy ? )

How to update the database

Hi，
I want to update the database to get more accurate results. Can I update it？

Thanks.

Pre-filtering?

Did you want to implement the CMash pre-filtering to enable profiling of bacteria as well?

Unable to setup MiCoP

Hi,
I am unable to install MiCoP as I get the following error message:
" awk: can't open file ./cookie". This seems to link to a file on a Google Drive according to the script. Has the file maybe moved or is differently named?

Many thanks,
Thierry

should I add "--paired" for paired end reads?

Hi, if my input seq are paired end reads, should I also add "--paired "?
My code is :
python run-bwa.py A033.R1_kneaddata_paired_1.fastq A033.R1_kneaddata_paired_2.fastq --fungi --output output/A033fungi.sam

No matter I add or don't add "--paired ", the .sam file will be very small, and after I run 'python compute-abundances.py', i will get one txt file with only one line "@@TaXiD RANK TAXPATH TAXPATHSN PERCENTAGE"

Below is the running info in the screen.
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 69706 sequences (10000282 bp)...
[M::process] read 69560 sequences (10000089 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 158, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (25, 122, 242)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 676)
[M::mem_pestat] mean and std.dev: (154.61, 129.99)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 893)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[mem_sam_pe] paired reads have different names: "A01066:85:H5G3LDMXY:1:1101:2899:1000:N:0:TAGTGGAAGC+NATAGGTACT#0", "A01066:85:H5G3LDMXY:1:1101:13575:1016:N:0:TAGTGGAAGC+TATAGGTACT#0"

I also want to ask how to merge the abundance txt files of many samples after I run 'python compute-abundances.py' for each sample?

Thanks!

Failing to open file when using --paired

Hi! I have paired-end files (R1 and R2), and I am trying to run them with the --paired option. I get an error which seems to suggest that the program thinks there is only a single file with the name of both files (`F08060-L1_S17_L002_R1_001.fastq F08060-L1_S17_L002_R2_001.fastq').

Running on a single file works fine.

Any ideas?

python run-bwa.py F08060-L1_S17_L002_R1_001.fastq F08060-L1_S17_L002_R2_001.fastq --fungi --output alignmentstest.sam --paired
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `F08060-L1_S17_L002_R1_001.fastq F08060-L1_S17_L002_R2_001.fastq'.

Possible to use more cores?

Not sure if it is appropriate to ask things which strictly are not errors, if it is not, I'll delete this at once.

I am simply wondering if there is any way to use more than one core to process each sample faster? I can't run several terminals with several samples in parallel because of memory limitations. Right now each sample takes 3-4 hours with the --fungi option.

-Simen Hyll Hansen