Git Product home page Git Product logo

rust-debruijn's People

Contributors

adam-azarchs avatar daniel-liu-c0deb0t avatar dcroote avatar evolvedmicrobe avatar jgarthur avatar k3yavi avatar pmarks avatar pryvkin10x avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rust-debruijn's Issues

DnaStringSlice should abstract away the backing DnaString rc status

The DnaStringSlice implementation is inconsistent when the slice is reverse complemented. Mer::get() is implemented (what I consider to be) correctly, but other functions only work if the slice has is_rc == false.

#[test]
fn dnastringslice_get_kmer() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    for i in 0..=3 {
        // The kmer in a slice should be the kmer of the sequencing represented
        // by that slice, regardless of whether the backing DnaString is RC or not.
        assert_eq!(slice.get_kmer::<Kmer4>(i), rcslice.get_kmer::<Kmer4>(i));
    }
}
#[test]
fn dnastringslice_slice() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    // The fact that a slice is backed by a DnaStringSlice that is
    // the rc of the slice sequence shouldn't matter.
    assert_eq!(rcslice.slice(1, 4), slice.slice(1, 4));
}

On a related topic, would it make sense to split the 2-bit encoding of a DNA string and associated kmer logic into their own crate? Two bit encoding and kmer counting is generically useful outside of de bruijn graph construction and is used for everything from kmer counting, error correction, mimimizer hash tables, to reference genome storage (e.g. sequence interval lookups in a memory maped 2bit encoded (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) reference genome would be very efficient but unfortunately, you've chosen a different packed encoding^).

^ The UCSC encoding uses a bit encoding in which the MSB indicate a purine base, so complementing the sequence is XORing with 0xAAAAAAAA instead of flipping all bits which is the approach used here.

Documentation clarification

I'm new to rust and I have some API clarfications that I am unsure if it's a missing feature in this library, or something I've overlooked. Clarification or examples on the following points would be greatly appreciated:

  • is there an efficient way to calculate hamming distance on DnaStringSlice?

hamming_distance() requires a DnaString which means I need to convert to Vec<u8> then DnaString to perform what could be an only slightly more expensive operation than the current hamming_distance implementation.

  • How do I iterate over the kmers in a DnaStringSlice/DnaString?

I feel like I'm missing something really obvious here. I'm trying to do something like Kmer16::string_iter(&slice) but can't seem to find the right function or an example of this in the docs.

  • Is there a way to efficiently load a DnaString from a rust-htslib Seq object?

htslib uses a 4bit encoding and it seems the only approach is to decode back to the raw 8-bit u8 encoding then back 2bit format.

Thanks

Should we slice when continuous shards are going to the same bucket ?

Hi guys,

I was thinking of another optimization but wasn't sure of its impact on overall pipeline. The thought is the following:

If you look at this part of the msp bucketing, I propose to replace this with the following:

    let mut min_positions = Vec::with_capacity(16);
    let mut min_pos = find_min(0, k - p);
    let mut min_pos_history = min_pos;
    min_positions.push((0, min_pos));

    for i in 0..(m - k + 1) {
        if i > min_pos {
            min_pos = find_min(i, i + k - p);

            if pval(min_pos) != pval(min_pos_history) {
                min_pos_history = min_pos;
                min_positions.push((i, min_pos));
            }
        } else {
            let test_min = pmin(min_pos, i + k - p);
            if test_min != min_pos {
                unreachable!();
                //min_pos = test_min;
                //min_positions.push((i, min_pos));
            }
        }
    }

My thought was if continuous shards with different min_pos yet going to same bucket, do we have to splice the super sequence ? What I did here was maintain the first min_pos and if the next min_pos is going to the same bucket I don't add it as a splice point.
Please let me know your thoughts if it can create any corner case.

Problem with the repetitive sequence

Hi guys,

May be I misunderstood something, but I think there could be a potential bug in the debruijn graph generation with super repetitive sequence. Imagine the following case:

AAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAA

It's a 45 length sequence with the following sequence A + AAAAT* 8 + AAAA.

I am curious what should be the right debruijn graph for this with 31 kmer size and strandedness as false (there is a potential problem with that too as msp and filter kmer assumes complementary definition of strandedness flag but assume we correct for it) i.e. I do want to canonicalize the kmers,

If I ran it correctly then this library gives the following gfa:

H       VN:Z:debruijn-rs
S       0       AAAAATAAAATAAAATAAAATAAAATAAAAT
L       0       +       1       +       30M
S       1       AAAATAAAATAAAATAAAATAAAATAAAATAAAAT
P       ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672|     0+,1+,1+        *

NOTE: I added the P flag and it was not available by default although L and S are available by default.
However twopaco, another library gives the following entries:

H       VN:Z:1.0
S       15      ATTTTATTTTATTTTATTTTATTTTATTTTT
S       8       AAAATAAAATAAAATAAAATAAAATAAAATAAA
S       12      ATTTTATTTTATTTTATTTTATTTTATTTTAT
P       ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672|     15-,8+,12-,8+,12-,8+    *

If I follow the P values from the second gfa I can recreate the input fasta sequence. However, the size of the recreated sequence from the first gfa would be shortened by 4 bases, I think because of the repetitive sequences. Not sure how twopaco handles it (my hunch is by maintaining the reference sequence info ?) but I think this is a problem and mot sure how can we tackle it.

Let me know if you guys have any thoughts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.