schatzlab / genomescope Goto Github PK

View Code? Open in Web Editor NEW

246.0 246.0 56.0 313.43 MB

Fast genome analysis from unassembled short reads

License: Apache License 2.0

PHP 11.99% HTML 3.89% R 18.61% CSS 10.43% JavaScript 45.38% Shell 1.96% Python 2.67% Perl 5.07%

genomescope's People

Contributors

Stargazers

Watchers

genomescope's Issues

Question about heterozygosity

Hello,

Sorry for the question that may seem obvious to everyone else. The software reports 2.5% heterozygosity for my sample. Does this mean I have 1 base every 40 that is heterozygous?

It seems high, we are at back-cross generation 7.

Can it be overestimated because genome size is underestimated (probably because of repeated sequences).

Thanks for you help.

Reasons for model not converging

Hi,

I ran genomescope on a k-mer count histogram from 130x Illumina coverage and the model did not converge.
Since coverage here is not insufficient, can you think of other issues that could be preventing the model from converging?

Best,
Gui

Length error

Hi,

I'm getting this error below about the read length (maybe) when I run GenomeScope. Can you help me resolve this issue? At the bottom I show the top of my .hist file.
Thanks,

Pat

ERROR:
GenomeScope analyzing HFTHJCCXY_2_reads.hist k=21 readlen=150 outdir=output_dir
Error in 1:error_xcutoff_ind : argument of length 0
Calls: estimate_Genome_4peak2 -> eval_model -> score_model
Execution halted

HIST FILE:
2 2428477947
4 94227510
6 14722302
8 5524970
10 2869024
12 1776189
14 1197518
16 862478
18 649496
20 504531

About Read length parameter for quality trimmed reads

Deal all,
I am using genomescope with quality trimmed reads.
Is there any rules for selecting read length parameter for trimmed reads?
Should I have to use average read length or read length before trimmed?
Best,

Versioning

Hi, would it be possible to tag the latest stable version for GenomeScope?

Genomescope2.0 release

Hi,

Do you see any possibility to release genomescope2.0 so it can be added to bioconda?

significant differences of genome size estimates when varying the Max kmer coverage

Hello,

I was wondering if you could help me to identify biological reasons or issues in the sequencing explaining the lack of genomescope model convergence.

To give you some background we have done illumina PE 150 bp sequencing on a bivalve species, The QC sequencing stats as reported by the fastqc tool are good. The is no previous genome size estimate apart from our own ONT draft assembly done with wtdbg2 which rendered a 1.5Gb assembly.

I have calculated 17, 19, 21, 31 and 41 kmer length count histograms. For instance, in order to calculate the 21 kmer length count histogram I did:

jellyfish count -t 6 -C -m 21 -s 1000000000 -o spec1_21mer --min-quality=20 --quality-start=33 18_193F_1.fastq 18_193F_2.fastq

jellyfish histo -o spec1_21mer.histo spec1_21mer

Then I uploaded this 21 kmer length count histogram to genomescope, which failed to converge with the default settings:
http://genomescope.org/analysis.php?code=smYeyQkykENUyhkkcLHd

Then I found out that lowering down the Max kmer coverage to 80 the algorithm converges and estimates the genome size around 3.5Gb:
http://genomescope.org/analysis.php?code=VvDSyL7EoHmr6G0yaVDe

Then if I lower down even further the Max kmer coverage to 60 I found out that the algorithm converges and estimates the genome size around 1.3Gb:
http://genomescope.org/analysis.php?code=Hm8hUSk0okQFaR4n9ykR

The genome size estimates I got using Max kmer coverage values of 60 (1.3Gb) and 80 (3.5Gb) are so different that I am unclear on how to set this parameter. I was wondering If you could please give me some hints to appropriately set Max kmer coverage or alternatively to diagnose to the lack of initial convergence due to issues in the sequencing.

Many thanks
Jorge

PacBio Sequel II data

Hi,
I am trying to use genomescope to estimate heterozygosity and genome size of a fairly large plant genome, ~ 3.65Gbp. I am using the histogram of canu-corrected reads. Here is what I am getting:

GenomeScope version 1.0
k = 22

property min max
Heterozygosity 0.775102% 0.795864%
Genome Haploid Length NA bp NA bp
Genome Repeat Length NA bp NA bp
Genome Unique Length 974,154,308 bp 977,407,537 bp
Model Fit 86.4268% 93.3631%
Read Error Rate NA% NA%

I am not sure why it is not estimating the genome size? Also, given that pacbio reads are of variable lengths, what would be the ideal read length to give to genomescope?

Many thanks,

Fail to converge (not due to coverage)

Hi there,
Thanks for this tool! After several other successful runs, I'm having issue running GenomeScope for a specific sample, it fails to converge. I saw in other issues that this can happen when coverage is low and there is no peak, but I do have a peak at coverage ~60. Just as a note, we expect it to be a very large genome. Here is the output and the .histo file.
Thanks

http://genomescope.org/analysis.php?code=GVXPjPCe62BZr4pXQOgD
out21kmer-sample.histo.zip

Bacterial reads

Hi there,
I am running genomescope for a set of bacterial reads, and I get meaningful results. But in what about heterozygosity? Since the organisms I am working with only have one copy of each gene (more or less), what does the value for heterozygosity mean with bacteria?

Genome Size Estimation

Hi,

As mentioned in the Supplementary Notes and Figures 1.3.2 Genome Size Estimation, the haploid genome size is estimated by: "This estimate is revised by summing the total number of k-mers, except presumptive sequencing errors identified as in section 1.3.1, and dividing by the 2*λ, the estimated coverage for homozygous k-mers".
If I understand it correctly, λ is the mean of a distribution, the estimated coverage for homozygous k-mers; and in the genomescope profile, kcov is the estimated coverage for heterozygous kmers.
Could you explain how genomescope estimates haploid genome size, specifically why dividing by 2 times of the estimated coverage for homozygous k-mers?

Best,
Quan

Genomescope2 dramatically underestimate genome size and overestimate heterozygosity

I have run Genomescope2 online with the default parameters
My starting material is : 16 fastq files from 16 SMRTcells (PacBio RSII)
I have run the following command lines :

jellyfish count -C -m 21 -s 12000000000 -t 20 ./*.fastq -o myreads.jf
jellyfish histo -t 20 myreads.jf > myreads.histo

The genome size is around 220 Mb and very homozygous

But in GenomeScope2 the results are :
Genome Haploid Length 16,296,648 bp 16,748,017 bp
Homozygous (aa) 95.0821% 95.7105%
Heterozygous (ab) 4.28954% 4.91792%

Full results are here : http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=RAe5oDd95jhgUdK9zZ8f

How can it be possible ?

Thanks for your help

Genomescope website doesn't work

Dear Schatzlab,
I tried to upload my .histo file to Genomescope but it shows the below message. Also, I tested the example file it showed the same error.

Thanks.

Won

kmer_max_cov value setting

Hi there!

We used Jellyfish and GenomeScope on pair-end NGS data to perform k-mer analysis on a new sequenced species of unknown genome size and unknown ploidy nature.

When I've run Genomescope with default parameters(kmer_max_cov=1000) ,I got a genome size of 100 Mbp.
http://qb.cshl.edu/genomescope/analysis.php?code=TARKVJSNnwnCrXHXRS29

Whereas for a kmer_max_cov=1000000 , the genome size was predicted to be ~243 Mbp.
http://qb.cshl.edu/genomescope/analysis.php?code=9i6yLDWTB7ZO2u73Lof2

I've already seen the discussion in this link #22 and I've seen that for high repetitive genome it is recommended to use 1M for kmer_max_cov. However, I was wondering if you have any clue on how to set this parameter.

Best regards,
Giulia Lopatriello

Improving figure legends

Hi,
thank you for a nice tool. I find the description of figures incomplete. I am newbie in this type of genome analysis but I think the Y-axis should probably read "Observed k-mer frequency". There is no description what is kcov, err, uniq, het, dup, len. The vertical dashed lines (supposedly the identified k-mer peaks) should be labeled within the figure with X and Y values. The X-axis is maybe Absolute coverage (to give an idea how many sites/loci are haploid, diploid, tetraploid)?
The figures should retain the information what Read length user provided during analysis.
Finally, the minimum where Error and Full model curves cross each other should be labeled as well. That should be my target minimum requirement for a contig coverage for an upcoming assembly, right? Then please depict the X and Y- value of this valley in the figure, and include them in a table below.

genomeScope result with HiC data

Hi Mike,
I am running Genomescope to get an estimate of genome size using paired-end HiC data. The expectation is that error profile is broader because HiC can capture ligation junction which is supposed to resemble "errors". But nevertheless, we tried running GenomeScope because it is believed that HiC is also genome wide sequencing.

The result is attached.
I am a bit confused about how to interpret the data. There is just one peak instead of 2 peaks, however, the plot still draws the peak at kcov. I am just wondering whether the plots are reliable to predict the estimated genome size.
Could you comment on this plot? According to this plot, the rate of heterozygosity is 1.66%, which is not low, yet why does it fail to produce a peak?

Also, I am wondering why the frequency plot from the 1st page differs from the one on the 3rd page, is it because the X axis is log-transformed?
Also, I noticed from a previous run with HIFI reads that the peak profile obtained from genome scope using coveragefrequency is different from using Frequency only, with the number of peaks more and being more apparent than using frequency only. So by default the coveragefrequency fits the model better?

Looking forward to your response.
thanks,
zhenzhen

Human genome

I used HG002 Illumina data to evaluate the genome size and the repeat rate by using your method. I got the unexpectant result. the estimation genom size is 2932210511,which lower than the expect genome size(expect genome size 310000000).in addition,the repeat rate is up to 29%，I don't think it's possible. when i caculate repeat rate by Genome Repeat Length /Genome Haploid Length ,i dont know if this done is right? can you help me?

Model Fit vs k-mer size

Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury

It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.

The fit values usually range within the 90's so I wonder how useful/accurate this approach is...

Furthermore, it seems that in some cases people report moderately different results from GS using different k's.

In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms.
k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe

Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.

The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).

Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k for running GS?
Thanks in advance,
Ben

In general do you have best practice re

Please provide download of results from website

Hi,
I lack an option to download my results from http://qb.cshl.edu/genomescope website. I also tried from firex "Save as" and wget --mirror -l2 but they both failed.

Ideally if one could group results and get animated GIF figures (I ran k-mer counting at various k-mer sizes).

Thank you

model... unconverged fail

Hi,
I want to generate k-mer distribution using a the histo file, but the model is unconverged.
starting
round 0 trimming to 15 trying 4peak model... unconverged
round 1 trimming to 20 trying 4peak model... unconverged
round 2 trimming to 25 trying 4peak model... unconverged
round 3 trimming to 30 trying 4peak model... unconverged
fail
But I can see two peaks in the data.
Can you give me some help?
Thanks!
19mer.histo.txt

Difference in genome size estimation between GS/GS2

I am working with a very heterozygous, diploid species whose genome is around 150-200Mbp. I have a lot of shotgun coverage for this species, and the k-mer spectra are nicely separated. Yes, the heterozygous peak really is that big, and yes - the "1X" coverage peak is around 96.

However, when I run the same spectrum on GS1 vs GS2, the genome size estimates are very different. I have made a plot using two individuals (Labelled Bf201606 and Bf201311), and compared the results of the minimum genome size from GS1 and GS2. As you can see, the genome size estimate for GS2 is much larger than GS.

Do you have any idea why the estimates are so different? Thank you for writing and maintaining GS/GS2!!!

using genomescope result to determine additional sequencing needs for size estimation

Thanks for the GenomeScope tool, we have used it successfully for several genomes we have assembled. Now we are trying to determine the genome size of a spider. Spider genomes can be as small as 0.8G and up to 5.5G with most in the 1.5G-2.5G range.

We have total R1 + R2 reads of 98,272,048 and total bases of 14,481,220,250.

The k21 and k17 GenomeScope 2.0 runs show:
Sel_R1R2_k21_reads.hist 37,359,898 bp http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=lF9U3QU9sf9PKDwfrPfW
Sel_R1R2_k17_reads.hist 77,184,108 bp http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=8tXxoeraIllggYPpHfX3

The size estimation is almost certainly for a subset of the genome. Is there a way to use the GenomeScope results to estimate how much additional sequencing we might need to get a better result. Our back of the envelope guess assuming 2.5G was somewhere around 80G bases total, however that is with very little foundation.

Any insight appreciated, thanks so much

Jim Henderson
California Academy of Sciences
Sel_R1R2_k17_reads.histo.txt
Sel_R1R2_k21_reads.histo.txt

21mer estimated genome size is the twice as the 17mer

Hi,

I use the jellyfish (47 Gb Illumina PE150) to count different mer, but I see the 21mer result is 874 Mb, the 17mer result is 437Mb. It is very strange that three peaks in the 17mer plot, but the first peaks sees fake, due to the second and third peaks are 1:2. Do you have suggestions for my speceis? How can I get a normal result.

question about genome size

Hi,
The estimated genome size of the coastal redwood output by GenomeScope is 27.0 Gbp and 2.293 Gbp for Gossypium barbadense in this paper <GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes>. But I'm confused about how to get these results from http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=example7 and http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=example8 website?

Only 1 Peak in genomescope plot

Hi,

I used Genomescope 2 on a many genomes and it workswell for most of my data.
I am working on diploid species and in most of the cases I get two peaks in the plot.
However, for a few I only get one peak. For example here:
What does this actually mean? Shouldn't there be two? Might this indicate that the species is not diploid?
Also, the assembly length based on this data is 1,233,588,871 bp.
What could explain the difference? Might the smaller estimate may be due to the additional parameter introduced in GenomeScope, set to exclude extremely high-frequency kmers as these likely represent organelle sequences or other contaminants that can inflate the genome size?
However, I used the following parameters:
k-mers were counted with JELLYFISH v2.2.10 jellyfish count -C -s 25556999998 -F 3 and a k-mer length of 21 (-m 21). A histogram of k-mer frequencies was produced with jellyfish histo. GenomeScope 2.0 was run with the following parameters:
Kmer length = 21, Max kmer coverage = 10000.

Thanks for any input on this question!

Inconsistent estimates across k-mer sizes

Hello,

I am running genomescope on two fungal genomes with unknown sizes. One genome is definitely haploid, the other could be haploid or diploid (we don't know). I was hoping that k-mer analysis would give me some hints as to whether the second genome is haploid or diploid.

I've run the analysis with k-mer sizes 17,19,21,23,25,29,35 and 39.

Genome 1 (haploid) behaves as expected. All k-mers give me c. 20 Mbp haploid genome size (which matches my assemblies), and heterozygosity estimates are very low (0.08-0.18%). Here is a plot for illustration:

Genome 2 behaves strangely across k-mer sizes.
K-mers 17,19 and 21 give haploid genome size of 30 Mbp (which matches my assemblies), and low heterozygosity (c. 0.20 %):

However, k-mers 23 upwards give haploid genome size of 15 Mbp and high heterozygosity (3.6-13.8 %). What confuses me is that I cannot see the two typical coverage peaks in the presence of heterozygosity (plots attached). Instead, the coverage range seems a lot wider:

I haven't got much experience with interpreting kmer analysis results. I am not convinced that genome 2 is diploid given the plots. Is there something weird going on with my data, or do I need to tweak parameters in genomescope? I'm attaching the historgram data for k-mer size 23, if that helps.

k23.hist.txt

Many thanks for your help.

Best wishes,
Marius

How does genomescope exclude the impact form chrX/chrY

Hi.

We get 'Heterozygosity 0.115156%' for NA1287 data.
How does genomescope exclude the impact form chrX/chrY？

property                      min               max
Heterozygosity                0.115156%         0.116636%

Best Regards

Genomescope 2 vs. 1

Hello,

I observed the following:
I ran Genomescope 1 and 2 on the same Illumina reads and the results look like this:
Genomescope2: Ploidy 2
http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=G1hKSxiXYsf1G01H9GJC
Genomescope1:
http://qb.cshl.edu/genomescope/analysis.php?code=xzUtugQSH2tCADSJ3tMg

If I set ploidy to 1 in Genomescope 2 I get the following:
Genomescope2: Ploidy 1
http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=0iaKscWqJTRuhvGxuBEE

Do you have any explanation for this?
Would you recommend using Genomescope 1 instead of Genomescope 2?
The assembly size fits the size obtained with Genomescope 1.

Thanks,
Jacqueline

High rate of errors, real Illumina sequencing error?

Dear all,

I have Illumina data (150 *2 bp), in total, 110 Gb. I cleaned it with Trimmomatics. I used Genomescope to estimate the genome size with kmer = 21. The estimated genome size is about 1.1G.
However, the proportion of errors are high as 2.2 percent. The het is high as 4.65 percent.
When I just use 60G data, Genomescope 'Failed to converge'
I have used Genomescope for many species. I never found this problem before.

Why does it have a high proportion of errors?

I appreciate your help.

Best,

Chen

HiFi data supported ?

Hi, I have HiFi data from the low input protocol sequencing.

Ran genomescope v2 online after:

KMC/bin/kmc -k21 -t5 -m64 -ci1 -cs10000 @cat.files cat.kmcdb tmp

KMC/bin/kmc_tools transform cat.kmcdb histogram cat.kmcdb_k21.hist -cx10000

But got a weird plot:

Any clue about is happening ?

http://genomescope.org/genomescope2.0/analysis.php?code=2sNtSnC7MenouKVq7ieI

PCR duplication

Hi,
Should I remove the PCR duplicated reads before running jellyfish?
Thank you!

Again, after remove PCR duplicated reads, I run jellyfish and GS2, and get the results,

The estimated genome size is about 1.15Gb.
On the other hand, I plot the jellyfish histo result, and stat the total number of Kmers, I got a genome size=Total number of Kmers / Pkdepth=2.1Gb.

Which was right?
Thanks a lot!

Genomescope failed to converge with -C option in Jellyfish

Hi, I'm using Jellyfish+Genomescope on pair-end NGS data. It seems like that Genomescope complains about failing to converge and the result seems not correct. I have tried changing kmer from 21 to 19, but it didn't work. I'm pretty sure the sequencing data is enough. When I turned off the -C option in Jellyfish, Genomescope seems to work good. I wonder what the reason is. So can anyone help me? The histo file is attached here. thanks!

Here are the commands I used:

jellyfish count -C -m 19 -t 16 -s 16G <(zcat CL100122025_L01_read_1.fq.gz) <(zcat CL100122025_L01_read_2.fq.gz) -o reads.jf
jellyfish histo -t 16 reads.jf > reads.histo
Rscript genomescope.R reads.histo 19 150 result
reads.histo.txt

underestimation of genome size

Hi,

Thank you for providing this handy tool!
I am interested in estimating the genome size of a fish using kmer analysis.

I did it "by hand" using this tutorial:
https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

and I used your web browser, but I got quite different genome size estimates:

by hand:
total kmer in the distribution: 17274199130
highest coverage: 18
genome size estimate: 959677729 (~960Mb) [the first three bins were removed for the calculation]

web browser (kmer length 25; read length 150; no max kmer cov filtering)
genome size estimate: 842099351 (~842Mb)

The assembly size of a closely related species is 957Mb so I guess the calculation "by hand" is more accurate? Do you know the reason of this discrepancy?

I attached the hist file from jelltfish & the genomescope outfiles

Thank you in advance,

Alexandra

KBC7_k25.histo.txt

underestimated genome size

This is more a usage question than an issue. I have a converged run, based on ~30x 100bp PE sequencing. The genome size estimate I get is ~527MB. From related spp and flow cytometry I know the real genome size is closer to ~650MB.

In the plot I notice a little coverage hump right before my cutoff at 1000. But changing the coverage cutoff does little to bring the genome size estimate closer to what I'm 99% sure is right.

Any thoughts on:

Settings I might alter to make the analysis better
Whether I might be limited by coverage here (seems 30x is near the boundary of what you suggest)
If genome size is lower than expected, whether this suggests anything about other estimates being a little off too.

(I can add more data if necessary...)

Please add support from ntCard k-mer histograms

Hi,
would you mind adding an auto-detection of data input to convert on the fly hist files from ntCard? See bcgsc/ntCard#8 (comment)

Why didnt the model converge? continued

Hi Genomescope,

Thanks for making the tool, this sounds like a great tool!! 👍 . I want to use it as I really want to know the predicted genome size and heterozygosity of my beast! I tried reducing the kmers from 31 down to 15 to try and increase the coverage:

$ jellyfish count -C -t 16 -m 15 -C -s 200000000 -o pe_data.jf15 both.fq
$ jellyfish histo -t 16 -o pe_data.jf15_0.jellyfish.histo pe_data.jf15_0

But the data (all of the kmer histogram, k=15 attached) always fails to converge using the online web server. http://genomescope.org/analysis.php?code=KhsSDR8CcZ8i6umW160j

Is there anything that can be done to try and fix/ help this? The mean coverage of the reads when mapping back to the Illumina assembly is 110X.

A little bit of background:
We believe this sample (nematode) to have high amounts of heterozygosity. I have performed a ploidy analysis using nQuire which statistically thinks it could be tetraploid, but the histogram does not agree (ploidy analysis could be affected due to the sample being a population). Literature has always said it is diploid. The sequencing material was thousands of nematodes, as they are very small). The Illumina assembled genome size = 120Mbp (SPAdes). PacBio data = 320Mbp (Canu default, running again trying to force haplotypes together).

pe_data.jf15_0.jellyfish.histo.txt

Could the sequencing depth affect the accuracy about the desicion of allo or auto-tetraploid?

Hi there. First of all, thank for your great work of the genomescope2.
I've encounter a problem when using the genomescope2 to distinguish the allotetraploid and the autotetraploid. My analysis pipeline:

fastp (filtering)
using kmc to generate the kmer frequency spectrum
using kmc_tools to get the histogram
using genomescope2 to get the final results.

When I do the sample one by one, the results show that most of them are allotetraploid, but when I merge the samples to each corresponding clade, then do the analysis, the results show that nearly all of the clades are autotetraploid.
Each raw data (paired-end) is about 3.5G, the merged data is about 15G~20G. The estimated genome size is about 300M. So is the problem of sequencing depth?
I'll really appreciate it, if somebody could give me a hint.

Incorrect peak identification in TE-rich genome

I'm working with a very repeat-rich fungal genome, with expected haploid genome size around 70 - 80 Mb. Genoscope seems to be locking onto the second (repeat-associated) peak and is producing a much lower than expected genome size estimate.

Here are the kmer coverage counts from the jellyfish histo file, trimmed at 4 and with the manually identified peak at 8. A second peak occurs between counts ~30-110.
Jellyfish_K25.pdf

Genoscope is lumping the primary peak in with the low coverge "error" segment.

Histo file: 25mer_out.txt

Would it be possible to force Genoscope to fit a model around a user-specified peak?

Completely different result between Genomescope v1 and v2

Hi,

I used genomescope from this repo:

https://github.com/schatzlab/genomescope
and
https://github.com/tbenavi1/genomescope2.0

And got weird results:

From v1:

https://ibb.co/FXPYPgV

From v2:

https://ibb.co/XC7Y5Nx

What could explain those big differences ?

hist have been generated with:

kmc -k21 -t5 -m64 -ci1 -cs10000 @cat.files cat.kmcdb tmp

kmc_tools transform cat.kmcdb histogram cat.kmcdb_k21.hist -cx10000

Removing PCR duplicates before running GenomeScope2/Smudgeplot

Is it advisable to remove PCR duplicates in an aligment-free way before running GenomeScope2 or Smudeplot? I guess this taken into consideration by the model, but want to make sure.

If yes, are there any points too keep in mind while doing this?

Fail to converge

Hi,

This tool looks quite useful and I'm trying to use it to my own data. I run the Jellyfish with the following command to obtain reads.histo:

zcat *.fq.gz | jellyfish count -C -m 21 -s 1G -t 10 /dev/fd/0 -o reads.jf
jellyfish histo -t 10 reads.jf > reads.histo

After uploading the file, the plot looks nice but GenomeScope fail to converge. What should I do to get things work for this data?

reads.histo.txt

Best,
Yang

model... unconverged

I am trying to generate k-mer distribution using a histo file but I don't think the resulting distribution is not making sense. Also, the model is unconverged in round 0 and round 2. Does it mean the model doesn't fit the data? Furthermore, the heterozygosity is coming 2.7%. With this rate, I expect a bimodal graph?
beast_19_2.histo.txt

Any suggestions?

Appropriate value for the Max kmer coverage

Hi!
I'm working with illumina data PE reads of 151 bp of a repeat-rich maize genome (diploid).
The genome size estimate is 2.3 Gb (maize B73v4).

I used the following command in Jellyfish:

> jellyfish count -m 21 -s 100M -t 10 -C <(zcat ZM1_R2.fastq.gz) <(zcat ZM1_R1.fastq.gz)
> jellyfish histo mer_counts.jf -o histo.txt

Then I uploaded this 21 kmer length count histogram to genomescope. I changed the default parameters of Read length to 151 bp, but I left the Max kmer coverage parameter with its default value (1000). The results I got were this:

https://ibb.co/fpzpT0P

https://ibb.co/yXmpC3n

And

Results
k = 21

property min -- max
Heterozygosity: 1.22337% -- 1.26647%
Genome Haploid Length: 1,116,998,872 bp -- 1,123,182,422 bp
Genome Repeat Length: 626,415,684 bp -- 629,883,434 bp
Genome Unique Length: 490,583,188 bp -- 493,298,988 bp
Model Fit: 87.1894% -- 98.1013%
Read Error Rate: 0.831279% -- 0.831279%

The genome size estimate is 2.3 Gb (maize B73v4) but in genomescope I get 1,123,182,422 bp.

I was wondering if someone could help me set a more appropriate value for the Max kmer coverage parameter because like the the creators warn:

High copy-number DNA such as chloroplasts can confuse the model. Set a max kmer coverage to avoid this. Default is -1 meaning no filter.

And I suppose I will have this problem of High copy-number DNA because my reads are from maize (with lots of chloroplasts DNA).

Thank you a lot for your kind help in advance :)

Failed to converge

Hello everyone,
This tool looks quite useful since I was about to start writing some R code with the same purpose.
To get familiar with the program I simulated illumina data from a small plant diploid genome (genome size 13Mbp) and produced histograms for 17, 19, 21, 23 and 25 mers with kmc3 and jellyfish. The read simulation uses Illumina error profile to generate the reads and produces around 40X coverage of the haploid genome.
All kmer sizes histograms fail to converge even though I can clearly see two peaks in my data (see data below).

The command line being used is the following:

for kmer in {17,19,21,23,25}
do
  Rscript genoscope.R ${kmer}.hg ${kmer} 100 Genscope_${kmer}
done

I would like to test this tool in real data, but I am afraid that I won be using it if it cannot handle simple simulated data. Any help would be more than welcome.

BTW, KMC3 histogram requires replacement of "\t" for spaces in order to be compatible with current version of Genoscope.

Kind regards,

Juan D. Montenegro

longer kmers

as you point out in the docs, longer kmers are more prone to sequencing errors. But what do you think about ignoring unique kmers, since these are likely to be due to sequencing error. MASH does something like this through the -m command (mash.readthedocs.io) but I don't know if this would mess with your downstream calculations too much.

Webserver error

Hi,
I was trying to upload a histogram file (e.g. http://qb.cshl.edu/genomescope/genomescope2.0/tests/inputk21.hist), and it seems to upload, but after clicking SUBMIT, it throws the error "no file uploaded".

how to ensure k-mer value

Hi,
I want to assembly a small region of human chromosome Y, what should I do to choose a reasonable k-mer? You known that we often use 21-mer to assembly human genomics, but how I just want to de novo a small region of human chromosome Y, do you think the 21-mer is ok?or I need to calculate new k-mer as formula：k = log4 (G(1−p)/p)（https://www.ncbi.nlm.nih.gov/pubmed/15087315?dopt=Abstract&holding=npg）
Thanks for your patient.
Best Duo

"Read length" changes does not affect the result

Hello,

I tried online version at the following website.
http://qb.cshl.edu/genomescope/

Our Hi-Seq data had 151 bases of "Read length", so I input 151 in "Read length" slot.
But, when I just tried using 251 in "Read length" slot instead of 151, the genomescope result was exactly the same with Read length 151.

As far as I know, Read length minus k-mer size is used to estimate genome size.
Could you please explain why various Read lengths do not affect the result?

Also, with increased k-mer, Repeat was decreased and genome size was increased. Is it possible to keep these values somehow constant using some adjustment parameters?
When I tried k-mer 17, the repeat result was similar to Repeatmasker result, but genome size was only about 70% of assembled genome. When k-mer was about 31 or higher, the genome size was ok, but repeats were dramatically reduced.

Thank you.
Hyun Jo Koo

Unable to understand the Results of genome scope

Hello Sir,

I hope you are doing well.

I am trying to use genome scope in order to know the percentage repeat content of the genome. I am having a hard time understanding the output. Can you please please help me out with this. The results I obtained are attached below. How can the interpret further that this much percentage of my genome is having repeat content by looking at results?

I ran the following command in order to get the results.

download jellyfish:
wget https://github.com/gmarcais/Jellyfish/releases/download/v2.3.0/jellyfish-2.3.0.tar.gz
untar the file
tar -xvzf jellyfish-2.3.0.tar.gz
download the linux one too.
wget https://github.com/gmarcais/Jellyfish/releases/download/v2.3.0/jellyfish-linux
went to the jellyfish directory.
cd jellyfish-2.3.0/
ran count to find kmers:
./bin/jellyfish count - C -m 21 -s 1000000000 -t 10 *.sra.fasta -o reads.jf

6.make histo file and dragged it to the site.
./jellyfish-2.3.0/bin/jellyfish histo -t 10 reads.jf > reads.histo

Your suggestion will really helpful.

question about results

Hi，

I use genomescope2.0 to estimate my diploid genome, and I use canu trimmed reads ( third generation reads ) as input.
It is my results:

p = 2
k = 21
max_kmercov = 100

property                      min               max
Homozygous (aa)               99.0226%          99.0952%
Heterozygous (ab)             0.904765%         0.977417%
Genome Haploid Length         260,404,300 bp    261,589,824 bp
Genome Repeat Length          22,044,856 bp     22,145,218 bp
Genome Unique Length          238,359,444 bp    239,444,606 bp
Model Fit                     96.6116%          96.6116%
Read Error Rate               1.35969%          1.35969%

It shows Genome Haploid Length : 260M ，should I double it as my final genome size ?
and in some report ,my interest genome size is about 450M.
read error rate is 1.36% shows my trimmed reads have good quality ?

Best wishes,
Bai

schatzlab / genomescope Goto Github PK

genomescope's People

Contributors

Stargazers

Watchers

Forkers

genomescope's Issues

Recommend Projects

Recommend Topics

Recommend Org