vtsyvina / cliquesnv Goto Github PK

View Code? Open in Web Editor NEW

20.0 3.0 5.0 71.68 MB

License: MIT License

Java 100.00%

haplotypes illumina quasispecies pacbio ngs

cliquesnv's People

Contributors

Stargazers

Watchers

Forkers

sergey-knyazev rsuchecki schaudge genostack dannovikov

cliquesnv's Issues

Potential bug when specifying -threads with 1?

Thanks for the great tool! I'm testing it out on some datasets, and I wanted to ensure it runs single-threaded via -threads 1 (I wasn't sure what the default option would be). It loaded the reads successfully, but then it hung with 0% CPU usage after that. When I tried -threads 2, it worked fine

[Suggestion] Change haplotype numbering to 1-based

I know that for a Java programmer this might seem strange 😉, but I have a hunch that majority of biologists and bioinformaticians are more familiar with 1-based numbering than 0-based.
Also, all of the other haplotype reconstruction programs use 1-based numbering, so when I'm comparing outputs I can't just say "Okay, let's now compare strain 2 from CliqueSNV with strain 2 from aBayesQR" unless I change the CliqueSNV output files first or remember to use strain 1 instead of 2.
Cheers!

Haplotypes don't add to 100%

Hello,
Your tool is amazing and intuitive to use. I am, however, having an issue with the tool frequency outputs. On our data, cliques assembly reports 3 haplotypes and the respective frequencies do not add to 100%. The dataset is a mock community of known composition. What could be a reason for this?

I appreciate your tool and your help!

Haplotype from assembled reads

Hi,
is there any way to generate haplotype from assembled reads using cliqueSNV

Thanks

-t and -tf parameters

Hello, my name is Andrea and I am a bioinformatician who is currently working with SARS-CoV-2 sequences in order to identify quasispecies within given samples. I have been trying to use CliqueSNV so as to achieve this goal, but I've been having some issues. When selecting the -tf parameter, I do not completely understand its relation with the -t parameter. I have understood it is related with the coverage, meanwhile -t is related to frequency. The issue resides in the fact that have been using this tool with -t value set in 0.01 and the -tf value set in different values (from 20 to 150) but I do not get much better results when changing this last value. Obtained sequences seem not to differ much in the different experiments, so this is been such an issue within my study.

Could someone help me?

Thanks in advance!!

output dir cannot be changed

Hi,

I am trying to output analysis results to a directory other than the "snv_output" in the same directory as the jar file, but it is not working.

My command line is
java -jar clique-snv.jar -log -threads 20 -t 100 -tf 0.01 -m snv-illumina-vc -in $dataDIR/aln_trimmed_sorted.sam -outDIR $outDIR/

Can you please let me know what is wrong with my command line ?

Thanks

snv with nanopore input

Hello
is it possible to run cliqueSNV for consensus and variant calling with nanpore input?
would snv-pacbio and consensus-pacbio work?

thank you!

suggested addon

amazing job!
one nice improvement to the output would be to include the number of reads used for each haplotype and put it in the sequence name. is it something you can get ?
a summary table with statistics would also be cool (total reads , retained, Freq, etc)

congrats again. the algorithm gives me really interesting results so far.

consensus-illumina

Hello
I am using cliqueSNV for both haplotype inference and consensus (Illumina). For the latter. Is there a way to specify the minimum frequency required for a base to be incorporated into the mixed base consensus sequence?

thank you!

No variants called in any haplotype when SNV is not in linkage with other SNVs?

Dear CliqueSNV team,

I've been experimenting with your tool and think perhaps I have found a bug. If there is a single, isolated SNV with no other SNVs in linkage within the mapping reads, i.e. distance to other SNVs is greater than read length, that SNV is never called in any of the haplotypes.

I'm trying to understand the algorithm described in your paper, and I guess this makes sense, because these SNVs are not in cliques with any other SNVs(?) But for some types of data it will mean that common haplotypes will not be present in the results. In one of my examples, there is a clear 45/55% distribution between C and T at a particular site, and the total read depth is around 30,000.

Graphic presentation of my problem.

I can provide bam files for testing if you'd like.

Error with cliqueSNV

Hi, I am getting this error when I run cliqueSNV with some samples. Could you let me know what this means?

Compute SNV data structurejava.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 352320
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at edu.gsu.start.Start.pacBioSNV(Start.java:211)
	at edu.gsu.start.Start.main(Start.java:69)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 352320
	at edu.gsu.util.builders.SNVStructureBuilder.fillRowsAndCols(SNVStructureBuilder.java:91)
	at edu.gsu.util.builders.SNVStructureBuilder.buildPacBio(SNVStructureBuilder.java:41)
	at edu.gsu.algorithm.SNVPacBioMethod.getHaplotypes(SNVPacBioMethod.java:102)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
SNV got 1 haplotypes

T parameter help

Hi,

I am trying to understand the how to properly choose fix the T parameter, which I understand is the min threshold for the O22 value (i.e. observed count of the 2-haplotype).

I am running one SARS-cov-2 sample (amplicon sequencing on Illumina) I am interested in.
I ran it with Tf=0.05 and T=1,10,100. Summary results are below.

Could you please explain me how T is affecting number of SNVs and haplotypes ?

SNVs numbers:
T=100 2412
T=10 3871
T=1 2491

Haplotypes N and freq:
T100:
0_fr_0.2847895994964371, 1_fr_0.19215312905857715, 2_fr_0.19203620837052798, 3_fr_0.1624256254050896, 4_fr_0.06115178494543045, 5_fr_0.05772617687199445
T10:
0_fr_0.2676417679182119, 1_fr_0.26451591468463653, 2_fr_0.15307681215147081, 3_fr_0.08339144250951734, 4_fr_0.08243902008443806, 5_fr_0.08039214779779189, 6_fr_0.06854289485393335
T1:
0_fr_1.0

Thanks

Memory heap getting bigger

Hi,
I am running the clique svn over a bam file from HIV virus about 140MB

java -Xms50000M -Xmx50000M -jar /clique-snv.jar -m snv-illumina -in TestHIV.trimmed_fastp.mapped_bowtie2.bam -threads 30 -outDir $TEMP_PATH
the bam is already cleaned
the memory expansion is getting larger and larger
-loop1 20GB failed
-loop2 30GB failed
-loop3 50 GB failed
-loop4 100GB still going

is there any way that we can decrease the amount of memory requirement?
Ideas1> removing duplicates from BAM
Ideas 2> splitting bam

how can I estimate the amount of memory required?
something like: the number of reads * number of core + cosmos diameter / IO speed

Thanks

UMI Nanopore Amplicons

Hello! Thanks for putting together CliqueSNV - the ability to leverage long reads should be really powerful across different fields I think. I was curious if it would be possible to use long reads that have already gone through error correction with CliqueSNV? Specifically, I am referring to amplicons sequenced with Oxford Nanopore with unique molecular identifier sequences (UMI - see: https://www.biorxiv.org/content/10.1101/645903v3.full). In short, the attachment of the UMI to amplicons allows for extensive error correction of the sequences. Would it be possible to effectively use these sequences with CliqueSNV? If so, which pipeline (Illumina or PacBio) which you suggest trying to input the amplicon sequences? I believe the PacBio pipeline filters ~10% of reads based on quality and since these reads have been error corrected perhaps there is a way to eliminate the filter step as a user? Any insight on how to leverage the unique properties of these reads with CliqueSNV would be greatly appreciated.

Thanks again!

Mappers and settings

Hi,
Have you experimented with different mappers and settings to generate the bam file? How important is accurate mapping for CliqueSNV? Do you have any recommendations for illumina reads?
Thanks, Jon

Tagged release and bioconda package?

I'm interested in making a conda recipe for CliqueSNV so a package for it can be made available for easy installation via the bioconda channel. A couple questions: 1) would that be OK? (it's under open license, so I'm primarily asking if it is ready for distribution), and 2) Are there plans to tag a release on this repository? Doing so would be helpful for versioning of the conda package.

bioconda is an open community, and the original authors of software are also welcome to submit packages (more info on how to do so here: https://bioconda.github.io/contributor/index.html )

Prevent -oe from exceeding the bounds of the haplotype

Hi there

If you provide the -oe option and this value exceed the length of the haplotype, an exception occurs. I'm using CliqueSNV in a pipeline and sometimes it's difficult to detect whether the -os -oe values fall within the bounds of the haplotype. My knowledge of java is minimal and I've implemented a fix (but not in an elegant way) by using an intermediate "end" variable and constraining it to the length. The solution works, but I fear a pull request would be ugly :-)

Padding operator not between real operators in CIGAR

Hello
Such an amazing tool (and great improvement from the first version). congrats!
working great on HIV Illumina data.
I am trying to get variants from SARS-CoV-2 data. Pretty large input sam files. I got this error for one of my sample.
any chance you can help with this? suggestion to fix? filter out some specific reads (automatically)?

thank you!

ERROR: Read name M03251:179:000000000-K2J5L:1:1106:10161:20380_1:N:0:CAAGGTAC, Padding operator not between real operators in CIGAR

``
@hd VN:1.0 SO:coordinate
@sq SN:MN908947.3 LN:29903
@rg ID:c0906184-f6a9-4be3-9fb3-268bd391c8d0 PI:185 SM:MBN033222-4_S4_L001 PL:ILLUMINA
@pg ID:0 VN:21.0 PN:clcgenomicswb

M03251:179:000000000-K2J5L:1:1106:10161:20380_1:N:0:CAAGGTAC 99 MN908947.3 28185 60 10M2D5M53D178M = 28251 280 CTTGTGGATCTGTTCTCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGGGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFGGGGGGGGGGGGGGGGEGGGGGGGFGGFEGEC*=DGCGBAGDEFGFGF5F5<>;56=CE>5>:?FFF48BF0<<:?746<>7),486A<<<2 MD:Z:5A1TG1^GT5^GTTCTATGAAGACTTTTTAGAGTATCATGACGTTCGTGTTGTTTTAGATTTCA117T60 RG:Z:c0906184-f6a9-4be3-9fb3-268bd391c8d0 NH:i:1 NM:i:59
M03251:179:000000000-K2J5L:1:1106:10161:20380_1:N:0:CAAGGTAC 147 MN908947.3 28251 60 1S8P6I3M1D188M22S = 28185 -280 GGATCTGTTCTCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACCGCTCTCACTCAACATGGCAAGGAAGACCT F@;3;3)167))89B950BCCEF515CF@5)CA9129FAA?4FEEFCF=<5C57;@?GG?:9CGDGF@=CGFGGGDGGE8EGEGGGFD8GFFGFCGGEGGFGGFGGGGGGGGGGFGGDGGFGGGGGGGGFGGGGGGGGGGGFGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:3^A188 RG:Z:c0906184-f6a9-4be3-9fb3-268bd391c8d0 NH:i:1 NM:i:7

A small doubt

I'm using your program and it is a great software. I've tested your program and I would like to suggest that the attribute data file (with freq, detected SNP, allele, etc) would be in json format, that it is easily parseable.
Respect of this file, I have the following question. In the output, I have one haplotype that has not SNP and I've assumed that this haplotype is the master/reference sequence. With this assumption, I think that this haplotype was the same sequence that was used in the read alignment step. I've checked this in my samples, aligning the haplotypes against my reference sequence and I've found that the haplotype given by clique-snv with no detected snv is different from my reference sequence. What means this? The virus population has fixed these changes? Could you add some processing step to detect this kind of changes?

Thank you in advance
Pedro Seoane

Circular genomes

Hi all,

Thank you for this amazing tool! I was wondering if you have some recommendation to best use CliqueSNV with circular viral genomes. Any pitfall we need to avoid in this case?

Thank you for your help,

Alise

vtsyvina / cliquesnv Goto Github PK

cliquesnv's People

Contributors

Stargazers

Watchers

Forkers

cliquesnv's Issues

Recommend Projects

Recommend Topics

Recommend Org