anuradhawick / metabcc-lr Goto Github PK

View Code? Open in Web Editor NEW

19.0 5.0 0.0 451 KB

Reference-free Binning of Metagenomics Long Reads using Coverage and Composition

Home Page: https://doi.org/10.1093/bioinformatics/btaa441

License: MIT License

Shell 2.56% Python 45.42% C++ 39.78% C 12.23%

metagenomics python c-plus-plus genomics binning long-reads

metabcc-lr's Introduction

MetaBCC-LR: Metagenomics Binning by Coverage and Composition for Long Reads

Dependencies

MetaBCC-LR is coded purely using C++ (v9) and Python 3.6. To run MetaBCC-LR, you will need to install the following python and C++ modules.

Python dependencies

numpy 1.16.4
scipy 1.3.0
kneed 0.4.2
seaborn 0.9.0
h5py 2.9.0
tabulate 0.8.7
umap-learn 0.5.1
song-vis (latest version from github)

C++ requirements

GCC version 9.1.0
OpenMP 4.5 for multi processing
PThreads (any version should work)

Downloading MetaBCC-LR

To download MetaBCC-LR, you have to clone the MetaBCC-LR repository to your machine.

git clone https://github.com/anuradhawick/MetaBCC-LR.git

Compiling the source code

Build the binaries

cd MetaBCC-LR
python setup.py build

sh build.sh

To install the program

pip install .

OR add the program path to your $PATH variable.

Running the MetaBCC-LR

Test run data

Extract test data from here;

In order to run MetaBCC-LR you are required to provide the reads in FASTQ or FASTA format.

python mbcclr --resume -r test_data/data/reads.fasta -g test_data/data/ids.txt -o test_output -e umap -c 25000 -bs 10 -bc 10 -k 4

Separate reads into Bins

You can use the script reads2bins.py to separate reads into bins. This is included in a separate script as you might want to play around with clustering sensitivity and sampling reads count to get a good final binning. You can look into images generated in Output/images directory to see if you have a good clustering of reads. Finally you can use the script reads2bins.py to separate reads.

Inputs:

-r path to reads file used for binning
-b output/final.txt (the file containing bin of each read)
-o a destination directory to place final fasta files

usage: reads2bins.py [-h] --reads READS --bins BINS --output OUTPUT

Separate reads in to bins.

optional arguments:
  -h, --help            show this help message and exit
  --reads READS, -r READS
  --bins BINS, -b BINS
  --output OUTPUT, -o OUTPUT

Usage and Help

cd MetaBCC-LR
./mbcclr -h

usage: mbcclr [-h] --reads-path READS_PATH [--embedding {tsne,umap,song}]
              [--k-size {3,4,5,6,7}] [--sample-count SAMPLE_COUNT]
              [--sensitivity SENSITIVITY] [--bin-size BIN_SIZE]
              [--bin-count BIN_COUNT] [--threads THREADS]
              [--ground-truth GROUND_TRUTH] [--resume] --output OUTPUT
              [--version]

MetaBCC-LR Help. A tool developed for binning of metagenomics long reads
(PacBio/ONT). Tool utilizes composition and coverage profiles of reads based
on k-mer frequencies to perform dimension reduction. dimension reduced reads
are then clustered using DB-SCAN. Minimum RAM requirement is 9GB.

optional arguments:
  -h, --help            show this help message and exit
  --reads-path READS_PATH, -r READS_PATH
                        Reads path for binning
  --embedding {tsne,umap,song}, -e {tsne,umap,song}
                        Embedding tool to be used for clustering
  --k-size {3,4,5,6,7}, -k {3,4,5,6,7}
                        Choice of k-mer for oligonucleotide frequency vector.
  --sample-count SAMPLE_COUNT, -c SAMPLE_COUNT
                        Number of reads to sample in order to determine the
                        number of bins. Set to 1% of reads by default.
                        Changing this parameter will affect whether low
                        coverage species are separated or not.
  --sensitivity SENSITIVITY, -s SENSITIVITY
                        Value between 1 and 10, Higher helps recovering low
                        abundant species (No. of species > 100)
  --bin-size BIN_SIZE, -bs BIN_SIZE
                        Size of each bin in coverage histogram.
  --bin-count BIN_COUNT, -bc BIN_COUNT
                        Number of bins in the coverage histogram.
  --threads THREADS, -t THREADS
                        Thread count for computation
  --ground-truth GROUND_TRUTH, -g GROUND_TRUTH
                        Ground truth of reads for dry runs and sensitivity
                        tuning
  --resume              Continue from the last step or the binning step (which
                        ever comes first). Can save time needed to run DSK and
                        obtain k-mers. Ideal for sensitivity tuning
  --output OUTPUT, -o OUTPUT
                        Output directory
  --version, -v         Show version.

Output path is the foldername that you wish the results to be in.
Specify the number of threads
The program requires a minimum of 5GB to run. This is because we have optimized the coverage histogram generation process to accommodate all 15mers in RAM for faster lookup of counts.

Citation

@article{10.1093/bioinformatics/btaa441,
    author = {Wickramarachchi, Anuradha and Mallawaarachchi, Vijini and Rajan, Vaibhav and Lin, Yu},
    title = "{MetaBCC-LR: metagenomics binning by coverage and composition for long reads}",
    journal = {Bioinformatics},
    volume = {36},
    number = {Supplement_1},
    pages = {i3-i11},
    year = {2020},
    month = {07},
    abstract = "{Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition.We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13\\% improvement in F1-score and ∼30\\% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications.The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR.Supplementary data are available at Bioinformatics online.}",
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa441},
    url = {https://doi.org/10.1093/bioinformatics/btaa441},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/Supplement\_1/i3/33488763/btaa441.pdf},
}

Update

Program can be built and installed with sh build and pip install . We recommend using sh build and using the program without installing. Thus making it easier to fetch future upadates and run.

New in v-2.X

No need to have DSK, we have implemented a consice k-mer counting strategy using compare and swap (CAS).
Supports UMAP and SONG embeddings. Please note that UMAP and SONG are still being improved. Needs more work from our side. But usable!
Supports any input format fasta, fastq or gzipped formats of either. (Thanks for Klib by Attractive Chaos blog)

metabcc-lr's People

Contributors

Stargazers

Watchers

metabcc-lr's Issues

Fasta support?

Hi, I would like to try your pipeline for the classification of fasta sequences (this is in fact a genome assembly where I want to remove contamination).

Are the reads quality scores used for something in the pipeline?

If not, would it be possible to implement fasta support?

As a first approach I may try to create a dummy fastq (but in the end this would be a waste of time and resources if quality scores are not used).
Thanks

MetaBCC-LR cannot locate the output folder (FileNotFoundError: [Errno 2] No such file or directory: 'metaBCCLR_default_params_output/misc/filtered_reads.fasta')

Hello,

Thank you for this tool, I can't wait to see the result on my Nanopore dataset. It is currently running :)
I just wanted to report a "bug": I got the following error a few minutes after launching the job:
FileNotFoundError: [Errno 2] No such file or directory: 'metaBCCLR_default_params_output/misc/filtered_reads.fasta'

I figured that it was because the folder "metaBCCLR_default_params_output" that I put in the --output argument already existed before I launched MetBCC-LR.
I think this might be because of this part of the MetaBCC-LR script:
if not os.path.exists(output):
os.makedirs(output)
os.makedirs(f"{output}/images")
os.makedirs(f"{output}/misc")

Since the last 2 lines are included in the if loop, the folders "images" and "misc" are not created when the output folder already exists, hence the error.

I hope this can help others!

Hugo

File name too long for plot image (OSError: [Errno 36] File name too long)

Hello,

Thanks again for developing this tool. I am trying metaBCC-LR to bin metagenomic Nanopore reads from a natural community.
With default settings, I got 2360 bins on my dataset, which I think is way too much from what I know of this community (and what I see in the plots tends to confirm that it is overestimated), so I'm trying to tune the sensitivity.

I ran into an OS error while doing so, because of a file name too long when writing the image files
OSError: [Errno 36] File name too long: 'metaBCCLR_default_params_output/images/Separation by composition Root-x-2-c-1-c-1-c-1-c-0-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1-c-1.png

I'm not sure to understand how these files are named? Is there a way to change that, or maybe to skip printing the file if there is an error?

Also, I think it would be interesting to know the estimated coverage for each bin (I guess that would be the mean or median coverage of the reads in the bin). Or maybe this is included in the misc/cluster-stats.txt file?

Thank you for your help,

Best regards

Hugo

output file further analysis

Hi, I am using your pipeline for some PacBio binning. The final output file is final.txt, which is the binID for each read. I want to generate the MAGs from these info. Could you provide your scripts to further analyze this, eg. take reads from each bin and then do flyer to assembly to generate a list of MAGs? Thank you!

line 76 error

Hi there,
How to resolve the following error while running ./MetaBCC-LR

./MetaBCC-LR
File "./MetaBCC-LR", line 76
checkpoints_path = f"{output}/checkpoints"
^
SyntaxError: invalid syntax

Regards,
Dinesh

Just can find one Bin in the final.txt

Hello
When i use the this command :"python mbcclr --resume
-r nanoflit.fasta -o test_output -e umap -c 613187 -k 5 -t 100" , and the finnal.txt just can find one Bin-1. I am not sure that some parameters are set correctly.

The operation of the software are as follows：
2023-02-22 09:16:59,945 - INFO - Command mbcclr --resume -r PGA-nanoflit.fasta -o test_output2 -e umap -c 613187 -k 5 -t 100
2023-02-22 09:16:59,945 - INFO - Resuming the program from previous checkpoints
2023-02-22 09:16:59,945 - INFO - Counting K-mers
INPUT FILE PGA-nanoflit.fasta
OUTPUT FILE test_output2/profiles/3mers
K_SIZE 5
THREADS 100
Profile Size 512
Total 5-mers 1024
Loaded Reads 6131871
2023-02-22 09:40:24,180 - INFO - Counting K-mers complete
2023-02-22 09:40:24,181 - INFO - Counting 15-mers
INPUT FILE PGA-nanoflit.fasta
OUTPUT FILE test_output2/profiles/15mers-counts
THREADS 100
Loaded Reads 6131871
WRITING TO FILE
COMPLETED : Output at - test_output2/profiles/15mers-counts
2023-02-22 09:48:27,783 - INFO - Counting 15-mers complete
2023-02-22 09:48:27,784 - INFO - Generating 15-mer profiles
K-Mer file test_output2/profiles/15mers-counts
LOADING KMERS TO RAM
FINISHED LOADING KMERS TO RAM
INPUT FILE PGA-nanoflit.fasta
OUTPUT FILE test_output2/profiles/15mers
THREADS 100
BIN WIDTH 10
BINS IN HIST 32
Loaded Reads 6131871
COMPLETED : Output at - test_output2/profiles/15mers
2023-02-22 09:54:35,125 - INFO - Generating 15-mer profiles complete
2023-02-22 09:54:35,126 - INFO - Sampling Reads
2023-02-22 10:07:02,935 - INFO - Sampling reads complete
2023-02-22 10:07:02,936 - INFO - Binning sampled reads

New PacBio HiFi data available for testing

Hello,
I am very interested in using your approach for binning HiFi reads. I plan to test it on some internal datasets, but wanted to make you aware of a new dataset I have made available on NCBI. The sample is ZymoBIOMICS gut microbiome standard D6331, a community containing 21 species in staggered abundances that mimic the human gut microbiome. This mock community also contains 5 strains of E. coli. All species and strains in this sample have reference genomes available, which can be useful for evaluating results. This sample was sequenced three times using different library prep methods (our standard input, low input, and new ultra-low input protocol). The standard and low input are PCR-free preps, whereas Ultra-low has an amplification step in library prep.

Additional information and the three HiFi fastq files are available on the NCBI Project page: http://www.ncbi.nlm.nih.gov/bioproject/680590

The SRA accessions are as follows:

Standard Input: SRX9569057

Low Input: SRX9569058

Ultra-Low Input: SRX9569059

I hope these prove useful for continued development of MetaBCC-LR! I am looking forward to using this tool for my work.

There are some questions about MetaBCC-LR that I would like to consult you

Hello，I have difficulty using this software. when i execute this command : ”python mbcclr --resume -r test_data/data/reads.fasta -g test_data/data/ids.txt -o test_output -e umap -c 25000 -bs 10 -bc 10 -k 4“，I don't know about how the parameter g is taken. Is it feasible to use ids.txt files in the test data? Hope to be able to get your help ！

Article supplementary data is unavailable.

I've tried to obtain the supplementary data from your article published with Oxford Bioinformatics, but there is a problem with the site and no URL is provided.

Are you able to provide it?

Need to mkdir "misc" and "images" before running?

Hi,

Thanks for developing this great tool. I found something a bit strange that I think python-related:

2021-05-21 11:27:18,941 - INFO - Filtering reads
Traceback (most recent call last):
File "/services/tools/anaconda3/4.4.0/bin/MetaBCC-LR", line 201, in
main()
File "/services/tools/anaconda3/4.4.0/bin/MetaBCC-LR", line 144, in main
runners_utils.run_filter(reads_path, output, ground_truth)
File "/home/people/dinghe/.local/lib/python3.6/site-packages/mbcclr_utils/runners_utils.py", line 25, in run_filter
output_fasta_file = open(f"{output}/misc/filtered_reads.fasta", "w+")
FileNotFoundError: [Errno 2] No such file or directory: './misc/filtered_reads.fasta'

I am not sure which python version has open(f"{output}/misc/filtered_reads.fasta", "w+") that can create a file along with non-existing folder. But al least python3.6 does not have this option. The processes worked fine after creating two folders "misc" and "images".

Also, I wonder if there's an option where all the fastq-bins can be created, i.e. each bins will have a fastq file, that would be great for downstream processes.

Best, Ding

Ground truth format

What should be the format for the ground truth file?

Currently, I am using

 readname  <tab> bin_id
 readname  <tab> bin_id
 readname  <tab> bin_id
....

But I am getting an error
TypeError: unhashable type: 'numpy.ndarray' at line 308 in mbcclr_utils/binner_core.py.

Improvements self-issue

Computations

Add checkpoint between counting 15mers and generating coverage histogram profiles.
Combine the steps and put a check for the computed 15mer counts. Load or compute and save. Avoid saving+writing cost
Use c++ valarray for assignment step (may be compiler will use SSE)
Add compiler optimization flags -O3 (make last assignment step faster)
Add evaluation step to the final assigned bins (there is a difference between large enough bins and classifications.txt file) Fix it!

Binner

Organize different embeddings into different classes
Provide composition-only and coverage-only options
Adjust re-sampling to suit different embedding strategies.
Maybe steal ideas from LRBinner as a pre-step for embedding. VAE+UMAP/SONG works extremely well!
High-dimensional noise filtering using LRBinner algorithm (should help with noise in ONT reads)

dsk not found

Hi, looking forward to using your tool, just an issue with getting it running. I had an issue with directories not being complete as posted by someone else the output/misc, though that was resolved.

Traceback error:
2020-11-04 12:39:31,923 - INFO - Filtering reads
Filtering reads longer than 1000bp: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 949/949 [00:00<00:00, 6174.47it/s]
2020-11-04 12:39:32,227 - INFO - Filtering reads complete
2020-11-04 12:39:32,227 - INFO - Running DSK k-mer counting
sh: 1: dsk: not found
2020-11-04 12:39:32,243 - ERROR - Error in step: Running DSK
2020-11-04 12:39:32,244 - ERROR - Failed due to an error. Please check the log. Good Bye!

Log files:

cat MetaBCC-LR/metabcc-lr.log
2020-11-04 12:36:21,521 - INFO - Filtering reads
2020-11-04 12:36:21,667 - DEBUG - Total of 949 reads to filter
2020-11-04 12:37:52,021 - INFO - Filtering reads
2020-11-04 12:37:52,175 - DEBUG - Total of 949 reads to filter
2020-11-04 12:39:31,923 - INFO - Filtering reads
2020-11-04 12:39:32,070 - DEBUG - Total of 949 reads to filter
2020-11-04 12:39:32,227 - INFO - Filtering reads complete
2020-11-04 12:39:32,227 - INFO - Running DSK k-mer counting
2020-11-04 12:39:32,227 - DEBUG - Running DSK
2020-11-04 12:39:32,243 - ERROR - Error in step: Running DSK
2020-11-04 12:39:32,244 - ERROR - Failed due to an error. Please check the log. Good Bye!

Do you think it is an issue with the number of reads < 1000?

After trying with various fastq files containing > 1000 reads, I get a similar error:

cat MetaBCC-LR/metabcc-lr.log
2020-11-04 12:54:16,495 - INFO - Filtering reads
2020-11-04 12:54:17,584 - DEBUG - Total of 2055 reads to filter
2020-11-04 12:54:17,922 - INFO - Filtering reads complete
2020-11-04 12:54:17,923 - INFO - Running DSK k-mer counting
2020-11-04 12:54:17,923 - DEBUG - Running DSK
2020-11-04 12:54:17,938 - ERROR - Error in step: Running DSK
2020-11-04 12:54:17,939 - ERROR - Failed due to an error. Please check the log. Good Bye!
2020-11-04 12:54:51,317 - INFO - Filtering reads
2020-11-04 12:54:53,425 - DEBUG - Total of 13727 reads to filter
2020-11-04 12:54:55,658 - INFO - Filtering reads complete
2020-11-04 12:54:55,660 - INFO - Running DSK k-mer counting
2020-11-04 12:54:55,660 - DEBUG - Running DSK
2020-11-04 12:54:55,676 - ERROR - Error in step: Running DSK
2020-11-04 12:54:55,676 - ERROR - Failed due to an error. Please check the log. Good Bye!
2020-11-04 12:55:47,121 - INFO - Filtering reads
2020-11-04 12:55:57,887 - DEBUG - Total of 71432 reads to filter
2020-11-04 12:56:09,098 - INFO - Filtering reads complete
2020-11-04 12:56:09,098 - INFO - Running DSK k-mer counting
2020-11-04 12:56:09,099 - DEBUG - Running DSK
2020-11-04 12:56:09,116 - ERROR - Error in step: Running DSK
2020-11-04 12:56:09,117 - ERROR - Failed due to an error. Please check the log. Good Bye!

What does dsk not found mean? A google search did not return anything obvious.

reads2bins.py – missing 'format' issue

Dear Anuradha,
I am very interested in the approach used by metabcc-lr. I've successfully generated bins etc, but I'm not good with python and hit an issue when running bin splitting script, reads2bins.py

Traceback (most recent call last):
File "/opt/miniconda/envs/pacbio_mag/MetaBCC-LR/reads2bins.py", line 29, in
for record, bin_id in zip(SeqIO.parse(readsPath), open(readBinsPath)):
TypeError: parse() missing 1 required positional argument: 'format'

Does 'format' here refer to the data input – Fasta or Fastq (my data is fastq), as the variable 'readsType' is defined above (line 17-20) but is not feed into the SeqIO output. Is something missing from the script??

Regards and thank you for developing,
David