Comments (5)
Hi Jiulong
Thank you for the kind words! I will address your questions one by one.
-
If you check out this script: https://github.com/RasmussenLab/phamb/blob/master/workflows/mag_annotation/scripts/run_RF.py
There is a minimum bin-size argument to the function (write_concat_bins) that concatenate the contigs of a viral-like bin, by default it's set to 5000bp. So only bins with a size of >=5000 bp are written to the .fna files that explains the number discrepancy. You can just change this argument to i.e. 2000 to have smaller bins written to the .fna files, if you are looking for micro viruses. -
Like with bacterial MAGs, the sequential order of the contigs in a viral bin/viral MAG is not known. Unless you find a reference genome to guide how to put the genome/contigs puzzle together, even though mosaicism in viruses may hinder this effort. The contigs are, by choice, not connected by gaps like "XXX" or some other accepted DNA character in fasta-files as it might mess up the viral evaluation machinery of CheckV.
-
I am glad you also were surprised to find some viral bins annotated as "High-quality" even though they contain numerous host-genes. So were we when we evaluated viral MAGs with CheckV. If you look closely in your table, the viral bins were evaluated by CheckV's HMM-model which Is only benchmarked and evaluated on single-contig viruses and not tailored for viral MAGs. In the manuscript we have addressed these predictions and recommend that they should not be taken seriously and discarded, instead researchers should focus on AAI-based predictions that are more whole-genome alignment based. That does not mean that all HMM-based predictions are wrong though, they are just based on viral-markers that work best with single-contig viruses.
The last two rows in your table looks very much like Giant-viruses and were predicted by the AAI-model.
I hope you find this information helpfull.
Best,
Joachim
from phamb.
Hi @joacjo,
I do appreciate your kindly and patient reply!
- As you said, I checked some viral bin sequences and found that all contigs were connected without any gaps. So I wonder if these contigs were connected in a random order as bin sequences. I have some other viral bins (actually they were NCLDV MAGs) obtained through other methods, and I want to merge all these viral bins followed by the clustering of these viral bins into the species level. So, I wonder if I can connect the contigs of my NCLDV MAGs in a random order to generate the viral bins. Additionally, may the connection of contigs in random order affect the downstream analyses of these viral bins, like gene annotations?
- Thanks for your pointing out my impropriety in selecting the HQ viral bins. Should I select the HQ viral MAGs evaluated by CheckV's AAI-based high-confidence model only or both high-confidence and medium-confidence models? Do you think the Medium-quality viral MAGs should also be selected for the downstream analyses?
- I wonder if this tool is suitable for binning NCLDV MAGs or only for binning phage MAGs?
Thanks for your reply!
Best,
Jiulong
from phamb.
Hi Jiulong
- If you obtained your NCLDV MAGs with either Metabat2 or VAMB there is no order to the contigs, they were simply grouped together. If you wanna dereplicate them with your new viral bins, you could dereplicate them on a MAG level with something like https://github.com/MrOlm/drep, I believe it takes fasta files as input, that is one fasta file for each bin with the contigs as separate fasta entries.
Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.
-
I would include both high-confidence and medium-confidence for High-quality AAI-predicted bins. Regarding Medium-quality viral bins, I would probably get a third party prediction on those to increase the confidence in their "viralness", like running Virfinder or VIBRANT as well.
-
That is a good question. I am sure the VAMB-binner also bins NCLDV MAGs since we have found several Huge viruses in our benchmarks. The more relevant question is more like, how can those confidently be identified?
Best,
Joachim
from phamb.
Hi Joachim,
- Regarding the worst-case scenario you mentioned, I totally agree with you! So, how about dividing the viral bins into the original multiple contigs? This can result in one fasta file for each bin with the contigs as separate fasta entries, beneficial for the dereplication by dRep.
- You are right that it is difficult to identify the confidently of this tool on identifying the NCLDV or Phage MAGs. Anyway, it is enough for you guys to develop this strong tool!
Best,
Jiulong
from phamb.
Hi Jiulong
Regarding (1) - Yes if you write the contigs of your viral bins of interest into separate fasta files, that is the most straightforward and safest way for you to do gene annotations and, plus as you say, you have your viral bins in a format suitable for dereplication with tools like dRep
Hope the information was helpfull to you. I will close the issue, feel free to open another if other questions or suggestions arise.
Best,
Joachim
from phamb.
Related Issues (20)
- Adapt for more recent Python versions.
- Running Phamb on long reads polished with short reads HOT 1
- How can I control the total number of threads used by phamb? HOT 1
- Running vamb/phamb using only Vibrant contigs HOT 3
- Random Forest Feature Names HOT 1
- Suggestion : add checkv step beforehand to remove complete genomes HOT 2
- Vcheck on predicted viral bins HOT 4
- Running Phamb without reads HOT 1
- Empty or outcommented file HOT 1
- Including micomplete Arch131 and/or VIBRANT annotations? HOT 1
- Question about the workflow
- modified header names in PHAMB HOT 4
- Versioned release package for Phamb HOT 16
- Parsing deepvirfinder line 512, in _parse_dvf_row contig_name, length, score, pvalue = line[:-1].split() HOT 2
- contig length HOT 1
- Update shebang lines in phamb python scripts HOT 2
- High number of bacterial genes in phamb assembled bins HOT 3
- split_contigs.py produces empty files
- Can PHAMB output comparable performance on environmental metagenome compared to gut metagenome
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from phamb.