Git Product home page Git Product logo

Comments (5)

joacjo avatar joacjo commented on June 21, 2024

Hi Jiulong

Thank you for the kind words! I will address your questions one by one.

  1. If you check out this script: https://github.com/RasmussenLab/phamb/blob/master/workflows/mag_annotation/scripts/run_RF.py
    There is a minimum bin-size argument to the function (write_concat_bins) that concatenate the contigs of a viral-like bin, by default it's set to 5000bp. So only bins with a size of >=5000 bp are written to the .fna files that explains the number discrepancy. You can just change this argument to i.e. 2000 to have smaller bins written to the .fna files, if you are looking for micro viruses.

  2. Like with bacterial MAGs, the sequential order of the contigs in a viral bin/viral MAG is not known. Unless you find a reference genome to guide how to put the genome/contigs puzzle together, even though mosaicism in viruses may hinder this effort. The contigs are, by choice, not connected by gaps like "XXX" or some other accepted DNA character in fasta-files as it might mess up the viral evaluation machinery of CheckV.

  3. I am glad you also were surprised to find some viral bins annotated as "High-quality" even though they contain numerous host-genes. So were we when we evaluated viral MAGs with CheckV. If you look closely in your table, the viral bins were evaluated by CheckV's HMM-model which Is only benchmarked and evaluated on single-contig viruses and not tailored for viral MAGs. In the manuscript we have addressed these predictions and recommend that they should not be taken seriously and discarded, instead researchers should focus on AAI-based predictions that are more whole-genome alignment based. That does not mean that all HMM-based predictions are wrong though, they are just based on viral-markers that work best with single-contig viruses.

The last two rows in your table looks very much like Giant-viruses and were predicted by the AAI-model.

I hope you find this information helpfull.

Best,
Joachim

from phamb.

Jiulong-Zhao avatar Jiulong-Zhao commented on June 21, 2024

Hi @joacjo,
I do appreciate your kindly and patient reply!

  1. As you said, I checked some viral bin sequences and found that all contigs were connected without any gaps. So I wonder if these contigs were connected in a random order as bin sequences. I have some other viral bins (actually they were NCLDV MAGs) obtained through other methods, and I want to merge all these viral bins followed by the clustering of these viral bins into the species level. So, I wonder if I can connect the contigs of my NCLDV MAGs in a random order to generate the viral bins. Additionally, may the connection of contigs in random order affect the downstream analyses of these viral bins, like gene annotations?
  2. Thanks for your pointing out my impropriety in selecting the HQ viral bins. Should I select the HQ viral MAGs evaluated by CheckV's AAI-based high-confidence model only or both high-confidence and medium-confidence models? Do you think the Medium-quality viral MAGs should also be selected for the downstream analyses?
  3. I wonder if this tool is suitable for binning NCLDV MAGs or only for binning phage MAGs?

Thanks for your reply!
Best,
Jiulong

from phamb.

joacjo avatar joacjo commented on June 21, 2024

Hi Jiulong

  1. If you obtained your NCLDV MAGs with either Metabat2 or VAMB there is no order to the contigs, they were simply grouped together. If you wanna dereplicate them with your new viral bins, you could dereplicate them on a MAG level with something like https://github.com/MrOlm/drep, I believe it takes fasta files as input, that is one fasta file for each bin with the contigs as separate fasta entries.

Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.

  1. I would include both high-confidence and medium-confidence for High-quality AAI-predicted bins. Regarding Medium-quality viral bins, I would probably get a third party prediction on those to increase the confidence in their "viralness", like running Virfinder or VIBRANT as well.

  2. That is a good question. I am sure the VAMB-binner also bins NCLDV MAGs since we have found several Huge viruses in our benchmarks. The more relevant question is more like, how can those confidently be identified?

Best,
Joachim

from phamb.

Jiulong-Zhao avatar Jiulong-Zhao commented on June 21, 2024

Hi Joachim,

  1. Regarding the worst-case scenario you mentioned, I totally agree with you! So, how about dividing the viral bins into the original multiple contigs? This can result in one fasta file for each bin with the contigs as separate fasta entries, beneficial for the dereplication by dRep.
  2. You are right that it is difficult to identify the confidently of this tool on identifying the NCLDV or Phage MAGs. Anyway, it is enough for you guys to develop this strong tool!

Best,
Jiulong

from phamb.

joacjo avatar joacjo commented on June 21, 2024

Hi Jiulong

Regarding (1) - Yes if you write the contigs of your viral bins of interest into separate fasta files, that is the most straightforward and safest way for you to do gene annotations and, plus as you say, you have your viral bins in a format suitable for dereplication with tools like dRep 👍

Hope the information was helpfull to you. I will close the issue, feel free to open another if other questions or suggestions arise.

Best,
Joachim

from phamb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.