Git Product home page Git Product logo

hulk's Introduction

Will's github stats

hulk's People

Contributors

dependabot[bot] avatar luizirber avatar will-rowe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hulk's Issues

Sort "smash" sketches for distance matrix

Often my input files are named/numbered so that there is some meaning when they are sorted. The output from "smash" does not order the columns, so I get:

Dolphin_8.fa.sketch,Dolphin_1.fa.sketch,Dolphin_2.fa.sketch,Dolphin_3.fa.sketch,Dolphin_4.fa.sketch,Dolphin_6.fa.sketch,Dolphin_7.fa.sketch

When it would help further downstream analysis if the columns were ordered lexicographically. Can you add a "sort" to the output before printing?

Allow "fa" extension for FASTA input

In "cmd/sketch.go":

106 if *fasta == true {
107 suffix1, suffix2 = "fasta", "fna"
108 }

I commonly use/see "fa" as a file extension and this is being rejected here. Maybe use a regex like this?

^f(ast|n)?a$

Which allows fa, fna, fasta.

panic when sketching

Dear Will,

With the newest go version 1.16, I have the following error:

(base) [jianshu@c391 fasta]$ hulk sketch --fasta -f S_Baltica_OS675.fna -o ./S_Baltica_OS675.hulk.sketch
2021/12/26 18:24:00 this is hulk (version 1.0.0)
2021/12/26 18:24:00 please cite Rowe et al. 2019, doi: https://doi.org/10.1186/s40168-019-0653-2
2021/12/26 18:24:00 starting the sketch subcommand
2021/12/26 18:24:00 checking parameters...
2021/12/26 18:24:00 mode: FASTA
2021/12/26 18:24:00 no. processors: 1
2021/12/26 18:24:00 minimizer k-mer size: 21
2021/12/26 18:24:00 minimizer window size: 9
2021/12/26 18:24:00 sketch size: 50
2021/12/26 18:24:00 streaming: disabled
2021/12/26 18:24:00 concept drift: disabled
2021/12/26 18:24:00 number of bins in k-mer spectrum: 194481
2021/12/26 18:24:00 adding KHF sketch: false
2021/12/26 18:24:00 adding KMV sketch: false
2021/12/26 18:24:00 initialising sketching pipeline...
2021/12/26 18:24:00 initialising the processes
2021/12/26 18:24:00 connecting data streams
2021/12/26 18:24:00 number of processes added to the sketching pipeline: 4
2021/12/26 18:24:00 number of minions in the sketching pool: 1
2021/12/26 18:24:00 finding minimizers...
2021/12/26 18:24:00 generating final histosketch of k-mer spectra...
panic: send on closed channel

goroutine 8 [running]:
github.com/will-rowe/hulk/src/pipeline.(*Minion).Start.func1(0xc00010c0f0)
/opt/conda/conda-bld/hulk_1563274818218/work/src/github.com/will-rowe/hulk/src/pipeline/minion.go:56 +0x5e
created by github.com/will-rowe/hulk/src/pipeline.(*Minion).Start
/opt/conda/conda-bld/hulk_1563274818218/work/src/github.com/will-rowe/hulk/src/pipeline/minion.go:35 +0x3f

Is that because the new version of go?

Thanks,

Jianshu

Option to print "smash" to STDOUT?

I need to massage the output from "smash" to create a distance matrix in the same form as that from Mash so I can reuse downstream visualization code. Right now, I indicate a prefix for the output file, run "smash," then look for an expected file name, then read that and create a new file. If there were a "--stdout" type flag for "smash," then I could just capture the text directly and forego all that other stuff. Maybe a separate issue, but it would be nice if this default smash output were closer to the kind of matrix that Mash and other tools create (e.g., add the sample names along to the rows, allow for comma or tabs as separators).

add minimizer sketch to the minimizer package

At the moment, the minimizer package collects minimizers but doesn't retain minimizer ordering from the sequence they were derived. This is because I am using a set implementation based on a map.

If I can't find a Go set implementation that maintains set order, I will have to roll my own.

Possible bug in weighted Jaccard distance calculation

I was looking at the code for the weighted Jaccard calculation and I noticed that weightA and weightB are set to the same value. They are compared subsequently, but never reassigned or altered. This looks like a bug--a copy/paste error perhaps?

weightA := math.Max(math.Max(SketchStore.SketchWeights[i], 0), math.Max(-SketchStore.SketchWeights[i], 0))
weightB := math.Max(math.Max(SketchStore.SketchWeights[i], 0), math.Max(-SketchStore.SketchWeights[i], 0))
// get the intersection and union values
if SketchStore.Sketch[i] == SketchStore2.Sketch[i] {
if weightA < weightB {
intersect += weightA
union += weightB
} else {
intersect += weightB
union += weightA
}

Support for Fasta input

Greetings!

I enjoyed reading the Hulk preprint, and I have successfully installed the conda package and am taking it for a test drive. Great work!

One of the first things I tried was to was sketch a reference genome assembly in gzip-compressed FASTA format. It doesn't appear that compression is an issue, but hulk won't sketch sequences in FASTA format. Is there any plan to support FASTA data in the future?

compress homopolymer runs

The minimizer algorithm is fairly simple at the moment. If using HULK with long reads, it is probably a good idea to compress the homopolymer runs when collecting minimizers.

Paired-ended Fastq Input files chosen (foward or reverse ended?), and why the score is not consistent?

Dear author,

First of all, I would like to thank you for providing this novel and interesting analysis tools, however, I do have some questions.

  1. Paired-ended fastq data
    I have read the paper and tutorial but it did not mention which one to choose (or I missed it), do you have any suggestion?
    I have tested both ended data , and the results are not completely the same. My guess is since the quality of forward ended is higher in general, so I used forward ended fastq file, or do you have other suggestions?

  2. The similarity matrix problem

I tested the software in my own mgs data, somehow I discovered that the matrix is like below:

For example
A B C D
100 12 15 25
10 100 20 43
13 17 100 32
23 45 31 100

As my understanding, the value is pairwise distance similarity, so the row name should be
A B C D as well, but we can see that the values in row and column do not exactly match, it is similar but not consistent. like in row A, A to B is 12, but in column A, A to B is 10

If you can help me to get through these issues, I will be much appreciated.

Thank you very much

question about k-mer frequency

dear will-rowe,
HULK is concerned about the k-mer frequency as described in your paper. I find that a minimizer hash value cannot be added into the minimizerSketch when it is contained in the sketch(minimizer.go, line 195). So the hash values in the minimizerSketch are unique. We cannot add the same minimizer hash value into the minimizerSketch inside a window, but if two different windows have the same minimizer-kmer, shall we concern the k-mer frequency?

int conversions

There are a lot of conversion between ints (uint64/32/8 and ints) etc. This is unnecessary and makes the code harder to follow. I'll work on reducing these

EncodeSeqFromPreviousKmer would speed up the encoding

Hi Will,

How about adding a function EncodeSeqFromPreviousKmer along with EncodeSeq, it would save some time by just encoding the last base.

I've just tested it in my repo:

BenchmarkEncodeK32-16                           20000000                52.7 ns/op
BenchmarkEncodeFromPreviousKmerK32-16           200000000               9.37 ns/op
BenchmarkMustEncodeFromPreviousKmerK32-16       1000000000              1.98 ns/op

Wei

Error reading sketches: file is corrupted (mismatched file names)

I've sketched a few read sets and want to try some of the other operations, namely print and distance. Whenever I attempt to invoke either of these operations, I get the following error message.

encountered error: file is corrupted (mismatched file names): sample1-k21.sketch

If I invoke the command without the .sketch extension then I just get a file not found error, so that doesn't seem to be the problem.

Does anything about the following commands look reveal a mistake on my part? Or is this potentially a software issue?

hulk sketch --fastq seq/sample1.fastq -k 21 -o sketches/sample1-k21 -p 4
hulk sketch --fastq seq/sample2.fastq -k 21 -o sketches/sample2-k21 -p 4
hulk print -f sketches/sample1-k21.sketch
hulk distance -1 sketches/sample1-k21.sketch -2 sketches/sample2-k21.sketch

I'm running hulk 0.0.2 installed via conda on CentOS 6.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.