Genome Size Estimation about genomescope HOT 4 CLOSED

schatzlab commented on August 13, 2024

Genome Size Estimation

from genomescope.

Comments (4)

mschatz commented on August 13, 2024

Hi Quan,

Thank you for writing. I think there is a bit of confusion over the
variable names and how they relate to each other. The first thing to note
is λ and kcov refer to the same value, just that we use λ in the written
document and kcov in the code. The modeling tries to identify 4 peaks
centered at λ, 2_λ, 3_λ, and 4_λ. These 4 peaks correspond to the mean
coverage levels of the unique heterozygous, unique homozygous, repetitive
heterozygous and repetitive homozygous sequences, respectively. So when it
estimates the haploid genome size, it divides by the 2_λ, which is the
average homozygous coverage, not "2 times the estimated coverage for
homozygous k-mers" as you write.

The other confusing aspect is what is meant by haploid genome size versus
diploid genome size. We consider the haploid genome size to mean the span
of one complete set of haploid chromosomes and the diploid genome size to
be the span of both haploid copies (total DNA content in one diploid cell).
In particular, in a human cell, the haploid genome size is about 3Gbp and
the diploid genome size is about 6Gbp. If you sequence a total of 300Gbp
for a human genome, that would be about 150Gbp (50x coverage) of the
maternal haplotype and about 150Gbp (50x coverage) of the paternal
haplotype. But since the heterozygosity rate in humans is so low, the main
peak in the distribution would be centered around 100x. However,
GenomeScope will still try to fit the 4 peaks, so should set the
heterozygous kmer coverage λ equal to 50x, and thus the homozygous coverage
to 2*λ = 100x. From this GenomeScope will compute the haploid genome size
as the total amount of sequence data (300GB) divided by the homozygous
coverage (100x) to report 3Gbp as expected. Kmers with higher coverage are
naturally scaled as well: kmers that occur 200 or 300 times in the kmer
profile (and thus are 2 or 3 copy repeats in the diploid genome, 4 or 6
times in the haploid sequences) are still scaled by 100x to contribute 2 or
3 copies to the estimate. Finally, note that if the two haplotypes have
significantly different lengths, then the reported haploid genome size will
be the average of the two.

Hope this helps!

Mike

On Thu, Oct 13, 2016 at 1:04 AM, danshu [email protected] wrote:

Hi,

As mentioned in the Supplementary Notes and Figures 1.3.2 Genome Size
Estimation, the haploid genome size is estimated by: "This estimate is
revised by summing the total number of k-mers, except presumptive
sequencing errors identified as in section 1.3.1, and dividing by the 2*λ,
the estimated coverage for homozygous k-mers".
If I understand it correctly, λ is the mean of a distribution, the
estimated coverage for homozygous k-mers; and in the genomescope profile,
kcov is the estimated coverage for heterozygous kmers.
Could you explain how genomescope estimates haploid genome size,
specifically why dividing by 2 times of the estimated coverage for
homozygous k-mers?

Best,
Quan

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAL988pu7YpaR5V8jSYgxNQUucAiWuUXks5qzbvGgaJpZM4KVfUf
.

from genomescope.

danshu commented on August 13, 2024

Hi Mike,

Thanks for your explanation. The reason that I asked this question is that the genome size estimated by Genome scope is about half of that summing all kmers and dividing by 2*λ.

I used dsk to count kmers and genomescope use jellyfish to count kmers. This should be the cause of the difference since jellyfish by default will not count kmers with frequencies higher than 10000.

Thanks,
Quan

from genomescope.

mschatz commented on August 13, 2024

Yes, the genome size calculation is pretty straightforward. The hardest
part is deciding which Kmers are caused by sequencing errors - at
reasonable coverage levels this is usually pretty easy for GenomeScope to
figure out as kmers with low coverage that dont fit the expect model. Oh,
and be careful with kmers that have extremely high coverage - we see that
these are often artifacts (like phiX spike in sequence) or in the case of
plant genomes, mitochondrial sequences. This inflated our original
Arabidopsis estimate by several tens of megabases until we corrected for it.

Cheers,

Mike

On Thu, Oct 13, 2016 at 11:01 PM, danshu [email protected] wrote:

Hi Mike,

Thanks for your explanation. The reason that I asked this question is that
the genome size estimated by Genome scope is about half of that summing all
kmers and dividing by 2*λ.

I used dsk to count kmers and genomescope use jellyfish to count kmers.
This should be the cause of the difference since jellyfish by default will
not count kmers with frequencies higher than 10000.

Thanks,
Quan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAL98wP1ScxkR02i3b9L3ckc6qGecP70ks5qzvB2gaJpZM4KVfUf
.

from genomescope.

danshu commented on August 13, 2024

Dear Mike,

Thanks for your advice. In my case, these extremely high coverage kmers seem to come from tandem repeats.

Best,
Quan

from genomescope.

Genome Size Estimation about genomescope HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent