I just mapped 0.5mil 150bp HiSeq reads from a human gut metagenome (previously profile

Thanks for the explanation! So in the <a href="https://www.biorxiv.org/content/10.1101

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

nothing is mapping to RefSeq-bacteria about strainpro HOT 19 CLOSED

nick-youngblut commented on July 17, 2024

nothing is mapping to RefSeq-bacteria

from strainpro.

Comments (19)

hsinnan75 commented on July 17, 2024

StrainPro is more accurate when the read depth is deep enough. The default sensitivity is at least 20X. StrainPro only reports a taxid if its read depth is above the threshold. You may try to use "-depth" to specify the read depth. For example, if you run StrainPro with "-depth 10", StrainPro will be more sensitive to low coverage data. Another option, "-freq" is used to set the minimal read count for reporting a candidate taxid (the default is 50). The benchmark data sets we used in our manuscript were generated with 50X of read coverage.

I forgot to put the two options in the help message. I've updated a new version. I also corrected the typo. Thank you very much!

from strainpro.

hsinnan75 commented on July 17, 2024

The reason that you did not get any hits at the kingdom and phylum level when using 0.5 mil read sample is that the read counts are based on the subspecies or species level. If StrainPro cannot identify any taxid at subspecies/species, it will not output any result. StrainPro was designed for identifying taxids at subspecies/species level.

from strainpro.

nick-youngblut commented on July 17, 2024

Thanks for the explanation! So in the StrainPro biorxiv paper, the recall is ~1 for all simulated datasets. I guess that means that all taxa in each simulation had >= 20X coverage, correct? I might have missed it in the paper: is their a plot/table showing StrainPro precision/recall as a function of coverage, including coverages below 20X?

from strainpro.

hsinnan75 commented on July 17, 2024

Yes, all taxa in each simulation had 50X coverage. Sorry, we did not evaluate the performance in terms of read coverage. We'll generate simulated data with different depths and evaluate the precision/recall accordingly. Thank you for the suggestion.

from strainpro.

hsinnan75 commented on July 17, 2024

@nick-youngblut I generated another 4 datasets with different read depths. For dataset with N-fold, I ran GSAlign with arguments of "-depth N -freq N". Then I evaluated the performance of each dataset. Here is the result. The performance on each dataset is all identical.

Coverage	Subspecies_Precision	Subspecies_Recall	Species_Precision	Species_Recall	Genus_Precision	Genus_Recall
10	0.9756	1.0000	1.0000	1.0000	1.0000	1.0000
20	0.9756	1.0000	1.0000	1.0000	1.0000	1.0000
30	0.9756	1.0000	1.0000	1.0000	1.0000	1.0000
40	0.9756	1.0000	1.0000	1.0000	1.0000	1.0000
50	0.9756	1.0000	1.0000	1.0000	1.0000	1.0000

from strainpro.

nick-youngblut commented on July 17, 2024

That's great! Thanks for running the test! Any idea where accuracy drops off and how quickly? For instance, does accuracy drop sharply at coverages below 10X or is accuracy still OK at 5X or even lower? I'm interested in the abundances of relatively rare species in human gut metagenomes, so it will be hard to get even 10X coverage.

from strainpro.

hsinnan75 commented on July 17, 2024

Since StrainPro identifies taxa based on the mapping density in each representative sequence, it is very efficient to ignore false alignments. Suppose G1 is present and G2 is absent, then we can expect to see a certain number of reads will be mapped to each representative sequence of G1. And we can expect to see only a few reads mapped to representative sequences of G2.

True alignments:

False alignments:

from strainpro.

hsinnan75 commented on July 17, 2024

@nick-youngblut I generated a dataset of depth 5X and ran StrainPro again. Its performance is identical to those on other datasets. However, I ran StrainPro with "-depth 5 -freq 5" on the dataset of depth 50X. The performance is shown below. It implies that if you use a low threshold for depth and read count on a dataset of deep depth, the precision on subspecies level detection would be not as good as high threshold.

Coverage	Subspecies_Precision	Subspecies_Recall	Species_Precision	Species_Recall	Genus_Precision	Genus_Recall
50X (-depth 5 -freq 5)	0.6452	1.0000	1.0000	1.0000	1.0000	1.0000

from strainpro.

hsinnan75 commented on July 17, 2024

I forgot to mention that the database I used is all bacterial genomes in NCBI RefSeq database, and the testset is SimData_1 in our manuscript.

from strainpro.

nick-youngblut commented on July 17, 2024

Thanks for running the additional test! So sub-species accuracy declines between 5-10X? What about the accuracy in estimating abundances (like the last table in your biorxiv paper)?

from strainpro.

hsinnan75 commented on July 17, 2024

The taxonomic prediction is more reliable if the read coverage is higher. The previous experiment suggests that if you lower the threshold of minimal read depth and read count, it would probably introduce more false predictions. Regarding the estimating abundances, we have shown StrainPro is able to produce better estimation than existing tools since we estimate the abundances based on the mapping density rather than read counts.

It is noteworthy that we did not test StrainPro on any real datasets at strain level since we could not find any real datasets with highly confident annotation. We did test on some real datasets at phylum level and the prediction is identical to other method's. If you have suitable datasets, please let me know. Thank you!

from strainpro.

nick-youngblut commented on July 17, 2024

Thanks for clarifying. We have previously run krakenuniq on some simulated datasets containing varying levels of intra-species diversity. The main goal of these simulations was to determine whether krakenuniq can accurately predict genome-level abundances for all genomes in a species/genus (eg., all pseudomonas genomes). Our simulations suggest that krakenuniq can predict genome-level abundances with pretty good accuracy, but it depends on the amount of intra-species diversity. We will compare StrainPro to krakenuniq.

In regards to the taxIDs, StrainPro can handle duplicate taxIDs in the reference dataset, correct? We can only provide taxIDs for our genomes at the species level, so multiple genomes have the same taxIDs. krakenuniq deals with this issue by creating pseudo-taxIDs.

from strainpro.

hsinnan75 commented on July 17, 2024

Yes, StrainPro can handle duplicate taxIDs in the reference dataset. However, they will be considered as one entity. The abundance estimation at species level or above is based on the read count instead of mapping density.

from strainpro.

nick-youngblut commented on July 17, 2024

So if we want genome-level abundances, then we would have to create pseudo-taxIDs and add them to the taxdump?

from strainpro.

hsinnan75 commented on July 17, 2024

If two or more genomes belong to the same species in the sample community and are labelled with the same taxID, their abundance levels cannot be estimated separately since they are measured by taxIDs. Technically StrainPro can estimate genome-level abundances if we consider sequence ID rather than taxID. If you can create pseudo-taxIDs for each genome, then StrainPro can produce genome-level abundances. However, if the database is modified, it is necessary to re-build representative sequences.

from strainpro.

nick-youngblut commented on July 17, 2024

Will StrainPro then ignore those pseudo-taxIDs when aggregating abundances at the species, genus, ..., kingdom levels?

from strainpro.

hsinnan75 commented on July 17, 2024

When estimating abundances at species-level or above, we use all read counts that belong to its sub-ranks. Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.

from strainpro.

nick-youngblut commented on July 17, 2024

Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.

Why not aggregate sub-species to the species level (and courser levels)?

Given the lack of support for genome-level abundances, I'm currently inclined to think that krakenuniq is more flexible than StrainPro.

from strainpro.

hsinnan75 commented on July 17, 2024

StrainPro predicts all-level abundances. Since pseudo-taxIDs are only assigned to sub-species, the abundance for pseudo-taxIDs will combine together to estimate the abundance of upper-levels.

from strainpro.

nothing is mapping to RefSeq-bacteria about strainpro HOT 19 CLOSED

Comments (19)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent