rasmussenlab / phamb Goto Github PK

Downstream processing of VAMB binning for Viral Elucidation

License: MIT License

Python 100.00%

phamb's Introduction

phamb

A Phage from metagenomic bins (phamb) discovery approach used to isolate metagenome derived viromes and High-quality viral genomes. phamb is now published in Nature Communications, have a look and let us know if you have any questions.

The repository contains scripts and workflows used in our viral follow up study on the binning tool VAMB where we have benchmarked not only the quality and quantity of viral MAGs but also the viral overlap with metaviromes.

We have applied this approach to 3 different datasets and recovered up to 6,077 High-quality genomes from 1,024 viral populations, this is 200% more compared to only evaluation single-contigs. Similar to what we have observed for Bacterial bins, VAMB achieves high intra-VAMB-cluster ANI (>97.5%) also for viral bins, our best example here is accurate clustering of crAss-like bins found in the IBD Human Microbiome Project 2 dataset.

Our (recommended) workflow is to isolate the virome search space prior to running viral evaluation/prediction tools. For this, we have trained a Random Forest model on viral bins established using paired metagenomic and metavirome datasets. This massively helps in reducing computational time especially on larger datasets.
We strongly advise to only use Medium-quality and High-quality viral bins evaluated using the AAI-model in CheckV. We found the HMM-model is not currently well-suited for viral MAGs. Low-quality viral bins may likely represent fragmented/incomplete viruses or general contamination, we advice caution with using these.

In our analysis, CheckV has been important for assessing the actual gain of using viral MAGs relative to single-contig evaluation, a big kudos to Nayfach et al. for this great tool.

[Prerequisites & Installation]

In order to run parallel annotations of contigs and running the Random Forest model you need snakemake and scikit-learn v. 1.0.2. The snakemake workflows comes with conda-environments, thus dependencies and programmes are automatically installed. Phamb can now be installed via bioconda thanks to @jayramr!

### New dependencies *Recommended*
conda install -c conda-forge mamba
mamba create -n phamb python=3.9
conda activate phamb 
mamba install -c conda-forge -c bioconda snakemake
mamba install -c conda-forge -c bioconda cython
mamba install -c conda-forge -c bioconda pygraphviz
mamba install -c conda-forge -c bioconda phamb

### Clone repository
git clone the repository https://github.com/RasmussenLab/phamb.git

### Alternative to bioconda - Quick install with pip
pip install -e .

### Test installation
mkdir -p testout 
run_RF.py test/contigs.fna.gz test/clusters.tsv test testout

1. MAG annotation for isolating Metagenomic derived viromes

Database and file requirements

VAMB clusters and concatenated assemblies.

contigs.fna.gz #Concatenated assembly 
vamb/clusters.tsv   #Clustered contigs based on the above contigs.fna.gz file

Furthermore.

VOGdb - untar vog.hmm.tar.gz to get all hmm files. Concatenate all the hmm-files these into an AllVOG.hmm file [File path needs to be specified in config.yaml]
Micomplete Bacterial HMMs [File path needs to be specified in config.yaml]
Clone DeepVirFinder git clone https://github.com/jessieren/DeepVirFinder

How to Run - Parallel annotation

Copy the phamb repository, extract the mag_annotation workflow and split contigs (using the provided script) to allow annotation to be run in parallel. If you have relatively few contigs or have the patience to annotate all contigs in one batch you can skip the Snakemake part.

mkdir -p projectdir 
cd projectdir 
git clone the repository https://github.com/RasmussenLab/phamb.git
cp -r phamb/workflows/mag_annotation .
python split_contigs.py -c contigs.fna.gz

Now the contigs.fna.gz is splitted into individual assemblies i.e. assembly/{sample}/{sample}.fna
In addition, a sample_table.txt file is created with a line for each sample. Check that sample_table.txt contains sample identifiers corresponding to the ones you expect. The number of lines should correspond to the number of samples used to make the concatenated assembly (contigs.fna.gz).
Now, specify paths for databases, vamb directory, location of assembly and computational resources in mag_annotation/config.yaml.

If everything is good and set, you can run the snakemake pipeline.

# Local 
snakemake -s mag_annotation/Snakefile --use-conda -j <threads>

#Aggregate results
mkdir annotations
cat sample_annotation/*/*hmmMiComplete105.tbl > annotations/all.hmmMiComplete105.tbl
cat sample_annotation/*/*hmmVOG.tbl > annotations/all.hmmVOG.tbl
cat sample_annotation/*/*_dvf/*dvfpred.txt > annotations/DVF.predictions.txt

# Clean the DVF files for multiple headers.
head -n1 annotations/DVF.predictions.txt > DVF.header # get first header
grep -v 'pvalue' annotations/DVF.predictions.txt > DVF.predictions # get predictions 
cat DVF.header DVF.predictions > annotations/all.DVF.predictions.txt # combine

Dependent on the number of samples, it may be relevant to run the Snake-flow on a High performance computing (HPC) server.

# HPC - this won't work unless you specify a legit group on your HPC in `config.yaml`
snakemake -s Snakefile --cluster qsub -j <threads> --use-conda

How to Run - not in parallel - quick and dirty

Make sure to have Prodigal, hmmer and DeepVirFinder depedencies installed. Check under mag_annotation/envs for relevant conda environments.

mkdir annotations
gunzip contigs.fna.gz
python3 /user/DeepVirFinder/dvf.py -i contigs.fna -o DVF -l 2000 -c 1
mv DVF/contigs.fna_gt2000bp_dvfpred.txt annotations/all.DVF.predictions.txt
prodigal -i contigs.fna -d genes.fna -a proteins.faa -p meta -g 11
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmMiComplete105.tbl <micompleteDB> proteins.faa
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmVOG.tbl <VOGDB> proteins.faa
gzip contigs.fna

Run the RF model

Running the provided script, the virome bins are written to a fasta file and bin-annotations are summarised in vambbins_aggregated_annotation.txt.

run_RF.py contigs.fna.gz vamb/clusters.tsv annotations resultdir

ls resultsidr
resultdir/vambbins_aggregated_annotation.txt
resultdir/vambbins_RF_predictions.txt
resultsdir/vamb_bins #Concatenated predicted viral bins - writes bins in chunks to files so there might be several!

We recommend VAMB bins to be evaluated with a dedicated Viral evaluation tool like CheckV or VIBRANT to identify HQ viruses.

checkv end_to_end resultsdir/vamb_bins/vamb_bins.1.fna checkv_vamb_bins

Further information

The RF model automates filtering of VAMB bins that are most likely bacterial and therefore provides a space of plausible viral entities for further validation. The RF-model has been trained on paired Metaviromes and Metagenomes to make precisde decisions based on simple parameteres as the ones below. Compared to a single contig viral prediction model, the RF approach is very accurate. The increased performance is likely explained by the RF model evaluating on bin-level where one sequence with a low viral score does NOT lead to a misprediction of the whole bin. Aggregated information (assuming the binning is really good!) from multiple-contigs improves prediction compared to single-contigs.

The RF model take few variables to make an accurate distinction.

binsize (bp)	nhallm	distinct_VOGs_factor	cluster_DVF_score
2.000.000	100	0.2	0.3
60.000	3	1.3	0.7

Bacterial MAGs and viral MAGs from the same metagenome can be efficiently associated using crispr-spacer approaches and sequence alignment (recommended cutoffs can be found in the article). From this, Host-viral abundance dynamics and bacterial pangenome modulation can be studied. Downstream viral proteome analysis should be based on the viral regions found in the contamination.tsv file produced by CheckV to prevent contaminating bacterial genes to influence the analysis.

phamb's People

Contributors

Stargazers

Watchers

Forkers

kenkeni-zju alienzj anyihu shvartsmanirina pythseq linxingchen suleimanaminu

phamb's Issues

The predicted 'viral' number in 'vambbins_RF_predictions.txt' is inconsistent with the actual number in 'vamb_bins.1.fna'?

Hi !
why 'resultdir/vambbins_RF_predictions.txt' label is ‘viral’ and 'resultsdir/vamb_bins/vamb_bins.1.fna' sequence nums is unequal?

thanks !

Can PHAMB be used directly for Virome analysis (enrichment of viral particles followed by sequencing)

It's great to have virus binning software, you guys have made an excellent contribution. I would like to ask about the relationship of PHAMB with VirSorter2, VirFinder.
Should I use VirFinder etc. to identify virus contigs first or should I use PHAMB to binning first?

Binning question, how to use vamb?

Hi !
Before operating phamb, i use vamb process binning,
Vamb rum mode: vamb --outdir output63
--fasta R63.contigs.fa.gz
--bamfiles R63_sort.bam
-o C

report err.log : Traceback (most recent call last):
File "/public/home/bioinfo_wang/00_software/miniconda3/envs/avamb/bin/vamb", line 33, in
sys.exit(load_entry_point('vamb', 'console_scripts', 'vamb')())
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 1395, in main
run(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 834, in run
cluster(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 665, in cluster
clusternumber, ncontigs = vamb.vambtools.write_clusters(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 440, in write_clusters
for clustername, contigs in clusters:
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 701, in binsplit
for newbinname, splitheaders in _split_bin(binname, headers, separator):
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 676, in _split_bin
raise KeyError(f"Separator '{separator}' not in sequence label: '{header}'")
KeyError: "Separator 'C' not in sequence label: 'k141_84347'"

But, the reuslt contain ‘k141_84347 ’ ：
‘ less contignames |grep "k141_84347" -A2 -B2 ' --> 'k141_512747
k141_170723
k141_84347
k141_170724
k141_512748'

the vamb operation result file contain :
'0 Oct 9 23:52 vae_clusters.tsv # why the file is empty？
7.7M Oct 9 23:52 contignames
2.6M Oct 9 23:52 lengths.npz
41K Oct 9 23:52 log.txt
77M Oct 9 23:52 latent.npz
815K Oct 9 23:51 model.pt
894 Oct 9 14:40 mask.npz
2.3M Oct 9 14:40 abundance.npz
252M Oct 9 14:38 composition.npz'

Thanks!

how to get the file 'clusters.tsv' ?

Dear @joacjo
how to get the file 'clusters.tsv' ? #Clustered contigs based on the above contigs.fna.gz file

Which steps do I need to run to get this file ?

Looking forward to you reply ,thanks a lot !

What are the criteria of RF model

Hi ！
What score, or threshold of RF model is used to classify bacteria from viruses?

thanks!

category of viruses identified by PHAMB ?

Dear@joacjo

Whether the viral sequence identified by PHAMB can be used to distinguish whether it is a phage or another virus type?
The viral sequences identified by PHAMB , are they all DNA viruses ? Can it distinguish between RNA viruses?
Is it possible to tell whether they are free virus types or integrated virus types ?

Looking forward to your reply !

contig length

Hi,

If I use contig with more than 1000 lengths, the only I need to do is change the parameter in RF.py and RF_modules.py, right?

Thanks!

MAGs?

I just had a brief look at the README while explaining what you do. You did not introduce MAGs and I had to look it up:)

It is metagenomic assembeled genome (MAG)?

Hope you are okay:)

Running Phamb on long reads polished with short reads

Hello,

I have long read nanopore data assembled with flye and polished with illumina reads. Is phamb compatible with a long read assembler like flye?

how to evaluate the bin-annotations?

Hi !
the results vambbins_aggregated_annotation.txt. and vambbins_RF_predictions.txt, which one to use to evaluate???

thanks!

interpret the results of RF model

Hi @joacjo

run_RF.py contigs.fna.gz vamb/clusters.tsv annotations resultdir

ls resultsidr
resultdir/vambbins_aggregated_annotation.txt
resultdir/vambbins_RF_predictions.txt
resultsdir/vamb_bins #Concatenated predicted viral bins

the result 'vamb_bins/vamb_bins.1.fna' is viral bins , could I consider it as a viral contig ?
because in some research paper, viral detection comes directly from assembly result, without binning of assembly contigs.

viral bins and viral contig ,could I consider these two concepts as the same?

Look forward your reply , thanks a lot !

Update shebang lines in phamb python scripts

Thanks for making this helpful viral binning tool! I have a minor request to improve the usability of the python scripts in the phamb directory, based on an issue I ran into while running the test data for phamb. See below:

Problem description

Using phamb 1.0.1, installed via conda on a linux server, I got the following error when running the test command:

Command:

phamb/run_RF.py test/contigs.fna.gz test/clusters.tsv test testout

Error:

phamb/run_RF.py: /usr/bin/python: bad interpreter: No such file or directory

Possible causes

I think the issue is caused by the shebang line in phamb/run_RF.py:

#!/usr/bin/python

Because I am running phamb in a conda env, my python is located in ${CONDA_PREFIX}/bin/python. I'm running a very clean system and don't have python installed globally.

Proposed solution

Change the shebang line of phamb/run_RF.py to a more universal shebang (discussed here):

#!/usr/bin/env python

Similarly, phamb/run_RF_modules.py and phamb/split_contigs.py could be changed to use the same shebang (they are currently using #!/bin/python).

I'm happy to make a PR for this minor change, if it helps. Thanks again!

Adapt for more recent Python versions.

Re-train the Random Forest model with a more recent version of Scikit-learn that is compatible with Python v. >3.8.
Update the documentation and dependencies accordingly.

Issues about the installation of dependence

Hi developer,

A very exciting work to develop this software to bin the phage genomes!
Unfortunately, I meet some problems in starting to install the software.
It seems like the Prerequisites you provided are conflicting and can not be installed simultaneously.

The errors log is as follows:

 (base) [mcs@mcs1 soft]$ conda create -n phamb snakemake pygraphviz python=3.8 cython scikit-learn==0.21.3
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: - 
Found conflicts! Looking for incompatible packages.                                                                                                                                                              failed                                                                                                                                                                                                             / 

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package pygraphviz conflicts for:
pygraphviz
snakemake -> pygraphviz[version='>=1.5']

Package system conflicts for:
pygraphviz -> python=3.4 -> system==5.8
snakemake -> python=3.4 -> system==5.8
cython -> python[version='>=2.7,<2.8.0a0'] -> system==5.8

Package _libgcc_mutex conflicts for:
scikit-learn==0.21.3 -> libgcc-ng[version='>=7.3.0'] -> _libgcc_mutex[version='*|0.1',build=main]
cython -> libgcc-ng[version='>=7.5.0'] -> _libgcc_mutex[version='*|0.1',build=main]
pygraphviz -> libgcc-ng[version='>=7.5.0'] -> _libgcc_mutex[version='*|0.1',build=main]
python=3.8 -> libgcc-ng[version='>=7.5.0'] -> _libgcc_mutex[version='*|0.1',build=main]

Package setuptools conflicts for:
scikit-learn==0.21.3 -> joblib[version='>=0.11'] -> setuptools
snakemake -> dropbox[version='>=7.2.1'] -> setuptools
cython -> setuptools
python=3.8 -> pip -> setuptools

Package ca-certificates conflicts for:
cython -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
python=3.8 -> openssl[version='>=1.1.1l,<1.1.2a'] -> ca-certificates
pygraphviz -> python=2.7 -> ca-certificates

Package numpy conflicts for:
snakemake -> networkx[version='>=2.0'] -> numpy[version='1.10.*|1.11.*|1.12.*|1.13.*|>=1.11.3,<2.0a0|>=1.12.1,<2.0a0|>=1.13.3,<2.0a0|>=1.14.6,<2.0a0|>=1.15.4,<2.0a0|>=1.16.6,<2.0a0|>=1.19|>=1.19.2,<2.0a0|>=1.21.2,<2.0a0|>=1.20.3,<2.0a0|>=1.20.2,<2.0a0|>=1.9.3,<2.0a0|>=1.9|>=1.12|1.9.*|1.8.*|1.7.*|1.6.*']
scikit-learn==0.21.3 -> numpy[version='>=1.11.3,<2.0a0']
scikit-learn==0.21.3 -> scipy -> numpy[version='1.10.*|1.11.*|1.12.*|1.13.*|>=1.14.6,<2.0a0|>=1.16.6,<2.0a0|>=1.21.2,<2.0a0|>=1.15.1,<2.0a0|>=1.9.3,<2.0a0|1.9.*|1.8.*|1.7.*|1.6.*|1.5.*']

Package python conflicts for:
snakemake -> python[version='3.4.*|3.5.*|3.6.*|>=3.5,<3.6.0a0|>=3.6,<3.7.0a0']
scikit-learn==0.21.3 -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']
python=3.8
scikit-learn==0.21.3 -> joblib[version='>=0.11'] -> python[version='2.6.*|2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.6|>=3.7|>=3.5,<3.6.0a0|>=3.10,<3.11.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|3.4.*|3.3.*']
snakemake -> boto3 -> python[version='2.6.*|2.7.*|>=2.7,<2.8.0a0|>=3.6|>=3.7,<3.8.0a0|>=3.10,<3.11.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|3.3.*|>=3.7|>=3.5|>=3.7.1,<3.8.0a0|>=3.3|>=3']
cython -> python[version='2.6.*|2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0|3.4.*|3.3.*']

Package certifi conflicts for:
snakemake -> requests[version='>=2.8.1'] -> certifi[version='>=2017.4.17']
cython -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']

Package bzip2 conflicts for:
pygraphviz -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
cython -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - feature:|@/linux-64::__glibc==2.17=0
  - cython -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - python=3.8 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - scikit-learn==0.21.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

I also tried to install these packages one at a time and I also failed.

So, maybe the versions of these packages you provided are wrong?

Hope you can help me solve this issue!

Thank you so much!

Looking forward to your reply!

Jiulong

Empty or outcommented file

Hi,

I'm trying to run PHAMB with the following:

python /work/projects/nomis/assemblies/viromes/submodules/phamb/workflows/mag_annotation/scripts/run_RF.py /work/projects/nomis/assemblies/virome_results/annotations/goodQual_final.fna.gz /work/projects/nomis/assemblies/virome_results/vamb_output/clusters.tsv /work/projects/nomis/assemblies/virome_results/dbs/phamb /work/projects/nomis/assemblies/virome_results/phamb_output

I have the below error, so could you please let me know how to fix it?

Traceback (most recent call last):
  File "/work/projects/nomis/assemblies/viromes/submodules/phamb/workflows/mag_annotation/scripts/run_RF.py", line 217, in <module>
    fastadict = _vambtools.loadfasta(infile,compress=False)
  File "/mnt/irisgpfs/projects/nomis/assemblies/viromes/submodules/phamb/workflows/mag_annotation/scripts/vambtools.py", line 383, in loadfasta
    for entry in byte_iterfasta(byte_iterator, comment=comment):
  File "/mnt/irisgpfs/projects/nomis/assemblies/viromes/submodules/phamb/workflows/mag_annotation/scripts/vambtools.py", line 264, in byte_iterfasta
    raise ValueError('Empty or outcommented file')
ValueError: Empty or outcommented file

Thank you,
Susheel

pbs vs sh

Hey Joachim,

so I sadly have to admit that I am stuck with snakemake and so I am considering "just" running a shell script. Therefore I am checking your best practice on executing jobs;)

What is the difference between .pbs and .sh scripts?

Suggesting changes to dvf.yaml

I hope this might help someone.

I ran phamb with the following command but encountered an error in DeepVirFinder.

snakemake -s /lustre7/home/bhimbiswa/MAGs/phamb/mag_annotation/Snakefile --use-conda -j 20

The first error was

Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/phamb/DeepVirFinder/dvf.py", line 49, in <module>
    import h5py, multiprocessing
ModuleNotFoundError: No module named 'h5py'

So I removed the DeepVirFinder conda environment and added 'h5py' to dvf.yaml. But got the following error.

Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/phamb/DeepVirFinder/dvf.py", line 53, in <module>
    import keras
  File "/lustre7/home/bhimbiswa/MAGs/phamb/.snakemake/conda/ea387140a96735e61bfb5c0b2ea20190/lib/python3.6/site-packages/keras/__init__.py", line 21, in <module>
    from tensorflow.python import tf2
ModuleNotFoundError: No module named 'tensorflow'

I found this suggestion from jessieren/DeepVirFinder#18 (comment). So I again removed the DeepVirFinder conda environment and added 'h5py=2.10.0' to dvf.yaml, but got a new error.

Using Theano backend.
WARNING (theano.configdefaults): install mkl with `conda install mkl-service`: No module named 'mkl'
Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/phamb/DeepVirFinder/dvf.py", line 131, in <module>
    modDict[contigLengthk] = load_model(os.path.join(modDir, modName))
  File "/lustre7/home/bhimbiswa/MAGs/phamb/.snakemake/conda/c23317aff605e94c122c50b24af4b0a2/lib/python3.6/site-packages/keras/engine/saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "/lustre7/home/bhimbiswa/MAGs/phamb/.snakemake/conda/c23317aff605e94c122c50b24af4b0a2/lib/python3.6/site-packages/keras/engine/saving.py", line 224, in _deserialize_model
    model_config = json.loads(model_config.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'

Finally, when I changed the dvf.yaml to the following and reinstalled the DeepVirFinder conda environment I was able to run phamb without an error.

name: dvf
channels:
  - bioconda
  - conda-forge
dependencies:
  - python=3.6
  - numpy
  - theano=1.0.3
  - keras=2.2.4
  - scikit-learn
  - Biopython
  - h5py=2.10.0
  - mkl-service=2.3.0

'run_RF.py' operation problem

Dear @joacjo

Question: Which part of the source code can I change, to fit my contigs file and Run the 'run_RF.py ' to get the final result ??

R63.contigs.fa.gz header format : >k141_84347 flag=1 multi=9.0000 len=118435
CCATAAATCTGATTTTAGTCAAAAAAATATGCAGTTTTTCAAAAAGGGTGTATAATTCTTTCGTTAC

'vae_clusters.tsv' format : ‘vae_1 k141_84347
vae_2 k141_92682
vae_2 k141_551576
vae_2 k141_358295’

run mode: python run_RF.py
R63.contigs.fa.gz
vae_clusters.tsv
annotations
resultdir

err report :
Traceback (most recent call last):
File "/public/home/bioinfo_wang/00_software/phamb-v.1.0.1/phamb/run_RF.py", line 223, in
reference = run_RF_modules.Reference.from_clusters(clusters = clusters, fastadict=fastadict, minimum_contig_len=2000)
File "/public/home/bioinfo_wang/00_software/miniconda3/envs/phamb/lib/python3.9/site-packages/phamb/run_RF_modules.py", line 272, in from_clust
genomes = cls._parse_clusters(clusters,fastadict,minimum_contig_len=minimum_contig_len)
File "/public/home/bioinfo_wang/00_software/miniconda3/envs/phamb/lib/python3.9/site-packages/phamb/run_RF_modules.py", line 286, in _parse_clu
contig_len = fastadict[contig].len()
KeyError: 'k141_84347'

Thanks !

Question about the workflow

Hi,

I just finished reading the paper. I want to know the workflow of your method more preciously.

As circled in the picture: Do the binned metagenomes come from VAMB? Is the basic idea of your work to separate the viral bins from all bins and assign the virus to its host?

Thank you very much if you could give me an example to illustrate the workflow!

split_contigs.py produces empty files

HI,
I am trying to run phamb in parallel-annotation mode, and when I used split_contigs.py as the code below
python split_contigs.py -c contigs.fna.gz
the results ´assembly´ folder and ´sample_table.txt´ file are empty.
What should I do?

thanks !

Versioned release package for Phamb

I'm working to make Phamb available on Bioconda.

Please refer. -> https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release

I need your help creating a versioned release to use for the Bioconda recipe. Once Phamb is added to Bioconda, it'll also be made available as a Docker container from Biocontainers, and as a Singularity image from the Galaxy Project. The Bioconda bot will also recognize future releases and automatically update the recipe.

Please let me know

Thanks
Jay

Running vamb/phamb using only Vibrant contigs

Hi,
I was wondering if it could be appropriate in your opinion to assemble reads into contigs, get the putative viral ones with Vibrant (or another equivalent software), concatenate them and then run vamb and then phamb?
Best
Greg

Error while running RF model

Hi. I am getting an error when I run the RF model.

I used the following command to start the run (I am using the latest version of phamb)

python /Phamb_new/mag_annotation/scripts/run_RF.py /Vamb/contigs.flt.fna.gz /Vamb/vamb/clusters.tsv /lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/annotations /Phamb_new/Result_dir

I got the following error.

Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF.py", line 227, in <module>
    viral_annotation = run_RF_modules.Viral_annotation(annotation_files=viral_annotation_files,genomes=reference)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 358, in __init__
    self._parse_viralannotation_file(filetype.lower(),file)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 386, in _parse_viralannotation_file
    annotation_tuple = parse_function(line)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 513, in _parse_dvf_row
    score =round(float(score),2)
ValueError: could not convert string to float: 'score'

This is how my "all.DVF.predictions.txt" file looks like.

name    len     score   pvalue
S10CNODE_1_length_374305_cov_118.066653 374305  0.4933076500892639      0.06760329330009819
S10CNODE_2_length_331174_cov_150.761282 331174  0.5215792059898376      0.05410151824155903
S10CNODE_3_length_327615_cov_134.196242 327615  0.6207031011581421      0.03997658433416421
S10CNODE_4_length_275508_cov_107.113522 275508  0.3987869620323181      0.09687287559483344
S10CNODE_5_length_273839_cov_39.234849  273839  0.37943029403686523     0.10166931037087393
S10CNODE_6_length_265257_cov_21.606357  265257  0.7501952648162842      0.029231815091774305
S10CNODE_7_length_254430_cov_27.129502  254430  0.6598391532897949      0.036350932849913135
S10CNODE_8_length_239244_cov_15.625518  239244  0.5251834392547607      0.05332729058085958
S10CNODE_9_length_235224_cov_151.910707 235224  0.4213518500328064      0.09149104917289826

Can you please help me in solving this?

Bhim

Running Phamb without reads

Hi! Is it possible to run PHAMB on a set of contigs that lack any contributing short reads (eg just a fasta of viral contigs). From the documentation this doesn't seem possible since VAMB requires the coabundance/coverage information upstream of PHAMB- but if so would be interested in how to run correctly- thanks!

Vcheck on predicted viral bins

I got a single Concatenated predicted viral bin from RF model in resultsdir/vamb_bins. Do I input this single file in checkV? How would checkV know which contigs belong to the same bin (or does that now matter at this stage?)

Thank you

Barbara

Random Forest Feature Names

Hello developers!
Thank you very much for putting this tool out into the world!

I ran the random forest model with the new recommended phamb dependencies like this:

python mag_annotation/scripts/run_RF.py ../contgs.fna clusters.tsv annotations resultdir

and was given this error messsage

"/home/user/miniconda3/envs/phamb/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names"

the binning output was still produced, but I'm wondering if the model ran correctly? Why might this error be occurring?

Thank you very much!
Carrie

Result assessment!

Hi developers!
Thanks for your contribution to the study field of viral ecology!
Recently, I used the PHAMB tool to identify viral bins from my bulk metagenomes, and I had some questions about the output results.

The result files contained vambbins_RF_predictions.txt and vamb_bins/vamb_bins.1.fna. In the file of vambbins_RF_predictions.txt, there were hundreds of thousands of bins labeled as "viral", whereas, in the file of vamb_bins/vamb_bins.1.fna, there were only tens of thousands of viral bin sequences recorded. So, which one is the real result?
We have known that the input files were assembled contig sequences and the cluster information resulted from VAMB. So, why are there no gaps in the bin sequences in the file of vamb_bins/vamb_bins.1.fna? For each viral bin, were multiple contig sequences connected without gaps? How to determine their sequential order?
After I got the viral bins, CheckV was used to evaluate the viral genome quality with the file vamb_bins/vamb_bins.1.fna as input. As a result, more than 500 viral bins were considered High-quality viral genomes. That's great! However, there are more than ten viral bins with a genome length of larger than 400 bp, and the longest one is more than 600 bp. So were these viral bins potentially belonging to giant viruses? And why did CheckV consider them as high-quality viral genomes when these viral bins contained high-proportioned host genes and low-proportioned viral genes (see figure below).

Thanks for your attention and reply in patience!
Looking forward to your reply!
Jiulong

How to Run - not in parallel - quick and dirty

Dear developers
What does "dirty" mean?

Can PHAMB output comparable performance on environmental metagenome compared to gut metagenome

Hi, given the Random Forest (RF) model to distinguish viral-like from bacterial-like genome bins in PHAMB was based on gut metagenome, I am wondering that can phamb output comparable performance on environmental metagenome compared to non-gut metagenome? Have you tested the performance of PHAMB on environmental metagenome?

Thank you in advance!

modified header names in PHAMB

Hi,

My assembled contigs have headers as

c_000003956504
c_000004841845
c_000004821562

which matches with the VAMB bin headers. But when I run PHAMB, I get bin headers as :

1470111
816445
3021234
1094390

How to get the PHAMB contig headers in the initial VAMB bin headers format?.

Including micomplete Arch131 and/or VIBRANT annotations?

Hi there,

Is it possible to modify the RF script to incorporate both the micomplete Bact105.hmm and the micomplete Arch131.hmm as input annotations?

Similarly, I'd prefer to use VIBRANT annotations over DeepVirFinder. Is this also possible to do?

Thank you!

How can I control the total number of threads used by phamb?

Hi,

Thank you for a great tool!

How can I control the total number of threads used by phamb?

I'm trying running phamb on my local Ubuntu server with a total of 80 cores, and I'd like to make phamb use not more than 60 cores.

When I set "threads_ppn" value to 3 in config.yaml, and set "-j" value to 20 in snakemake command, it seems that nearly all 80 cores are used during DeepVirFinder step.

Are there any ways to limit the total number of threads used by phamb?

Thanks.

Parsing deepvirfinder line 512, in _parse_dvf_row contig_name, length, score, pvalue = line[:-1].split()

Hello,

I am really happy to be trying the PHAMB pipeline on my data. I am running it on small co assemblies, I do not have a concatenated assembly but I am running the pipeline separately for each coassembly. Is this a wrong approach?

When I run the RF model I have the following error, given by python:

Parsing deepvirfinder
Traceback (most recent call last):
  ...
  File "path/to/phamb/workflows/mag_annotation/scripts/run_RF_modules.py", line 512, in _parse_dvf_row
    contig_name, length, score, pvalue = line[:-1].split()
ValueError: too many values to unpack (expected 4)`

The head of my clusters.tsv

1	k141_169383 flag=1 multi=4.0000 len=2138
2	k141_566141 flag=1 multi=5.0000 len=1337
3	k141_562874 flag=1 multi=3.0000 len=2128
4	k141_174278 flag=1 multi=3.0000 len=1243
5	k141_155879 flag=1 multi=4.0000 len=1035
6	k141_981516 flag=0 multi=7.5058 len=1355
7	k141_615867 flag=1 multi=3.0000 len=1068
8	k141_749989 flag=1 multi=4.0000 len=1960
9	k141_945068 flag=0 multi=15.6210 len=2455
10	k141_1091919 flag=0 multi=5.9626 len=1318

the head of my all.DVF.predictions.txt

name	len	score	pvalue
k141_344865 flag=1 multi=4.0000 len=1127	1127	6.64381843762385e-07	0.8834881788654733
k141_620757 flag=0 multi=3.7828 len=1260	1260	0.061418987810611725	0.2213724601556009
k141_298883 flag=1 multi=3.0000 len=1290	1290	0.013160040602087975	0.3235138605634867
k141_390848 flag=1 multi=2.0790 len=1179	1179	0.6529936790466309	0.036823022886924996
k141_206919 flag=0 multi=10.9103 len=1479	1479	1.0	0.0
k141_505802 flag=1 multi=25.0000 len=1881	1881	0.08912927657365799	0.196616058614699
k141_1057576 flag=1 multi=3.0000 len=1049	1049	0.635226845741272	0.038635848629050534
k141_896644 flag=0 multi=200.6066 len=1872	1872	0.9405460357666016	0.01478585995921142
k141_1034585 flag=0 multi=3.0000 len=1245	1245	0.9999510645866394	0.0011518996903089357

Is it due to the 4 columns composing the name of the contigs? Any suggestions?

Thanks again for the great pipeline!

Suggestion : add checkv step beforehand to remove complete genomes

Hi,
I compared the results that I obtained using vibrant contigs before and after running phamb using checkv

Before

checkv_quality	n	mean	sum	max
Complete	557	46179.5	25721993	373392
High-quality	413	44008.8	18175622	275626

After

checkv_quality	n	mean	sum	max
Complete	351	54600	19164596	197996
High-quality	975	72413.1	70602775	622104

Where mean is the mean of contig length, sum is the total length of all contigs and max is the maximum size of the biggest “virus”.

This suggests to me that phamb wants to combine contigs even if they are considered "complete".
Maybe adding a checkv step beforehand to remove the complete ones from the phamb analysis could be useful?

High number of bacterial genes in phamb assembled bins

HI,

I used phamb with recommended workflow(not in parallel) with the default settings on my assembled metagenomic contigs (mixed of all microbial contigs). Later, I used CheckV ( with prodigal -m option enabled) on the concatenated fasta file. Strangely, CheckV analysis revealed that a large number of the bins contained a high number of host (bacterial) genes, accounting for more than 50% (many contigs with more than 70%) of the total number of genes. Surprisingly, CheckV indicates that many of these bins are complete and without contamination. However, the presence of such a large number of host genes will interfere in the downstream analysis. I have attached my checkv results for your reference.
quality_summary.txt

VAE or AAE?

Hi all,
I'd like to learn which cluster file generated by vamb is suitable for feeding to phamb, that via VAE or AAE. Many thanks!

split_contigs.py produces empty files

Dears,
I am trying to run phamb (cloned repo) in parallel and when I used split_contigs.py as the code below

python split_contigs.py -c contigs.fna.gz

it is produces and empty ´assembly´ folder and an empty ´sample_table.txt´ file. I tried with an gunziped contig file and is the same result.
What should I do?

thanks in advance!,
Sandro

rasmussenlab / phamb Goto Github PK

phamb's Introduction

phamb

[Prerequisites & Installation]

1. MAG annotation for isolating Metagenomic derived viromes

Database and file requirements

How to Run - Parallel annotation

How to Run - not in parallel - quick and dirty

Run the RF model

Further information

phamb's People

Contributors

Stargazers

Watchers

Forkers

phamb's Issues

Problem description

Possible causes

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org