Comments (4)
Hi Quan,
Thank you for writing. I think there is a bit of confusion over the
variable names and how they relate to each other. The first thing to note
is λ and kcov refer to the same value, just that we use λ in the written
document and kcov in the code. The modeling tries to identify 4 peaks
centered at λ, 2_λ, 3_λ, and 4_λ. These 4 peaks correspond to the mean
coverage levels of the unique heterozygous, unique homozygous, repetitive
heterozygous and repetitive homozygous sequences, respectively. So when it
estimates the haploid genome size, it divides by the 2_λ, which is the
average homozygous coverage, not "2 times the estimated coverage for
homozygous k-mers" as you write.
The other confusing aspect is what is meant by haploid genome size versus
diploid genome size. We consider the haploid genome size to mean the span
of one complete set of haploid chromosomes and the diploid genome size to
be the span of both haploid copies (total DNA content in one diploid cell).
In particular, in a human cell, the haploid genome size is about 3Gbp and
the diploid genome size is about 6Gbp. If you sequence a total of 300Gbp
for a human genome, that would be about 150Gbp (50x coverage) of the
maternal haplotype and about 150Gbp (50x coverage) of the paternal
haplotype. But since the heterozygosity rate in humans is so low, the main
peak in the distribution would be centered around 100x. However,
GenomeScope will still try to fit the 4 peaks, so should set the
heterozygous kmer coverage λ equal to 50x, and thus the homozygous coverage
to 2*λ = 100x. From this GenomeScope will compute the haploid genome size
as the total amount of sequence data (300GB) divided by the homozygous
coverage (100x) to report 3Gbp as expected. Kmers with higher coverage are
naturally scaled as well: kmers that occur 200 or 300 times in the kmer
profile (and thus are 2 or 3 copy repeats in the diploid genome, 4 or 6
times in the haploid sequences) are still scaled by 100x to contribute 2 or
3 copies to the estimate. Finally, note that if the two haplotypes have
significantly different lengths, then the reported haploid genome size will
be the average of the two.
Hope this helps!
Mike
On Thu, Oct 13, 2016 at 1:04 AM, danshu [email protected] wrote:
Hi,
As mentioned in the Supplementary Notes and Figures 1.3.2 Genome Size
Estimation, the haploid genome size is estimated by: "This estimate is
revised by summing the total number of k-mers, except presumptive
sequencing errors identified as in section 1.3.1, and dividing by the 2*λ,
the estimated coverage for homozygous k-mers".
If I understand it correctly, λ is the mean of a distribution, the
estimated coverage for homozygous k-mers; and in the genomescope profile,
kcov is the estimated coverage for heterozygous kmers.
Could you explain how genomescope estimates haploid genome size,
specifically why dividing by 2 times of the estimated coverage for
homozygous k-mers?Best,
Quan—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAL988pu7YpaR5V8jSYgxNQUucAiWuUXks5qzbvGgaJpZM4KVfUf
.
from genomescope.
Hi Mike,
Thanks for your explanation. The reason that I asked this question is that the genome size estimated by Genome scope is about half of that summing all kmers and dividing by 2*λ.
I used dsk to count kmers and genomescope use jellyfish to count kmers. This should be the cause of the difference since jellyfish by default will not count kmers with frequencies higher than 10000.
Thanks,
Quan
from genomescope.
Yes, the genome size calculation is pretty straightforward. The hardest
part is deciding which Kmers are caused by sequencing errors - at
reasonable coverage levels this is usually pretty easy for GenomeScope to
figure out as kmers with low coverage that dont fit the expect model. Oh,
and be careful with kmers that have extremely high coverage - we see that
these are often artifacts (like phiX spike in sequence) or in the case of
plant genomes, mitochondrial sequences. This inflated our original
Arabidopsis estimate by several tens of megabases until we corrected for it.
Cheers,
Mike
On Thu, Oct 13, 2016 at 11:01 PM, danshu [email protected] wrote:
Hi Mike,
Thanks for your explanation. The reason that I asked this question is that
the genome size estimated by Genome scope is about half of that summing all
kmers and dividing by 2*λ.I used dsk to count kmers and genomescope use jellyfish to count kmers.
This should be the cause of the difference since jellyfish by default will
not count kmers with frequencies higher than 10000.Thanks,
Quan—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAL98wP1ScxkR02i3b9L3ckc6qGecP70ks5qzvB2gaJpZM4KVfUf
.
from genomescope.
Dear Mike,
Thanks for your advice. In my case, these extremely high coverage kmers seem to come from tandem repeats.
Best,
Quan
from genomescope.
Related Issues (20)
- FAQ not up to date regarding ploidy HOT 1
- No file uploaded error message HOT 3
- Low model fit while ran Genomescope 2.0
- Strange genomescope result HOT 1
- High heterozygous? HOT 3
- No file upload message HOT 1
- Unable to converge HOT 3
- 503 Service Unavailable HOT 2
- mergeing HiFi data of two samples didn't increase hetorozygosity
- Heterozygosity rate < 0 HOT 3
- expectations for pooled samples HOT 2
- Heterozygous tetraploid genome model fit HOT 2
- GenomeScope Output from PacBio Hifi ccs Reads is Confounding HOT 3
- Model and observations don't converge HOT 3
- dup (in figure) and bias (in model.txt) HOT 2
- Follow up RE: Confounding GenomeScope Output HOT 5
- Ploidy determination HOT 1
- Heterozygous peak identified as errors? HOT 2
- Estimated genome size is half HOT 2
- Should a larger K value be chosen?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genomescope.