10xgenomics / rust-debruijn Goto Github PK

View Code? Open in Web Editor NEW

62.0 24.0 16.0 11.01 MB

De Bruijn graphs in Rust

License: MIT License

Rust 100.00%

rust-debruijn's Introduction

rust-debruijn

De Bruijn graph construction & path compression libraries.

Docs

Key features

2-bit packed fixed-length (Kmer) and variable-length (DnaString) sequence containers
Statically compiled code paths for different K values
Ability to track arbitrary auxiliary data through the DeBruijn graph
Customizable kmer counting & filtering schemes supporting a variety of use cases
DeBruijn graph compression
Minimum-substring partitioning to shard kmers for memory efficient counting and DeBruijn graph compression
Configurable for stranded and non-stranded input sequence
Extensive unit test suite
In production use in Supernova, Long Ranger, Cell Ranger, and Cell Ranger VDJ pipelines from 10x Genomics.

rust-debruijn's People

Contributors

Stargazers

Watchers

Forkers

dkj k3yavi jeff-k nceglia wwood tomkellygenetics daniel-liu-c0deb0t dcroote jianshu93 evolvedmicrobe lonsbio ekimb daniel-henning noamteyssier cauliyang jlab

rust-debruijn's Issues

Documentation clarification

I'm new to rust and I have some API clarfications that I am unsure if it's a missing feature in this library, or something I've overlooked. Clarification or examples on the following points would be greatly appreciated:

is there an efficient way to calculate hamming distance on DnaStringSlice?

hamming_distance() requires a DnaString which means I need to convert to Vec<u8> then DnaString to perform what could be an only slightly more expensive operation than the current hamming_distance implementation.

How do I iterate over the kmers in a DnaStringSlice/DnaString?

I feel like I'm missing something really obvious here. I'm trying to do something like Kmer16::string_iter(&slice) but can't seem to find the right function or an example of this in the docs.

Is there a way to efficiently load a DnaString from a rust-htslib Seq object?

htslib uses a 4bit encoding and it seems the only approach is to decode back to the raw 8-bit u8 encoding then back 2bit format.

Thanks

Should we slice when continuous shards are going to the same bucket ?

Hi guys,

I was thinking of another optimization but wasn't sure of its impact on overall pipeline. The thought is the following:

If you look at this part of the msp bucketing, I propose to replace this with the following:

    let mut min_positions = Vec::with_capacity(16);
    let mut min_pos = find_min(0, k - p);
    let mut min_pos_history = min_pos;
    min_positions.push((0, min_pos));

    for i in 0..(m - k + 1) {
        if i > min_pos {
            min_pos = find_min(i, i + k - p);

            if pval(min_pos) != pval(min_pos_history) {
                min_pos_history = min_pos;
                min_positions.push((i, min_pos));
            }
        } else {
            let test_min = pmin(min_pos, i + k - p);
            if test_min != min_pos {
                unreachable!();
                //min_pos = test_min;
                //min_positions.push((i, min_pos));
            }
        }
    }

My thought was if continuous shards with different min_pos yet going to same bucket, do we have to splice the super sequence ? What I did here was maintain the first min_pos and if the next min_pos is going to the same bucket I don't add it as a splice point.
Please let me know your thoughts if it can create any corner case.

max_path ignores path validity when choosing next node

In the method graph::DebruijnGraph::max_path, the code only uses the passed solid_path function to check if there is only one valid option, but ignores which one it is when choosing the next node.

let mut solid_paths = 0;
for (id, dir, _) in edges {
    let cand = Some((id, dir));
    if osolid_path(cand) {
        solid_paths += 1;
    }

    if oscore(cand) > oscore(next) {
        next = cand;
    }
}

if solid_paths > 1 {
    break;
}

If none of the edges are solid paths, it will still choose neighbouring node with the highest score as the next one. If solid_path returns true for only one node, the next node is still chosen solely based on the result of score.
This could probably be fixed by moving the second if block inside the first - unless, of cause, I misunderstood the purpose of the solid_path function.

Problem with the repetitive sequence

Hi guys,

May be I misunderstood something, but I think there could be a potential bug in the debruijn graph generation with super repetitive sequence. Imagine the following case:

AAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAA

It's a 45 length sequence with the following sequence A + AAAAT* 8 + AAAA.

I am curious what should be the right debruijn graph for this with 31 kmer size and strandedness as false (there is a potential problem with that too as msp and filter kmer assumes complementary definition of strandedness flag but assume we correct for it) i.e. I do want to canonicalize the kmers,

If I ran it correctly then this library gives the following gfa:

H       VN:Z:debruijn-rs
S       0       AAAAATAAAATAAAATAAAATAAAATAAAAT
L       0       +       1       +       30M
S       1       AAAATAAAATAAAATAAAATAAAATAAAATAAAAT
P       ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672|     0+,1+,1+        *

NOTE: I added the P flag and it was not available by default although L and S are available by default.
However twopaco, another library gives the following entries:

H       VN:Z:1.0
S       15      ATTTTATTTTATTTTATTTTATTTTATTTTT
S       8       AAAATAAAATAAAATAAAATAAAATAAAATAAA
S       12      ATTTTATTTTATTTTATTTTATTTTATTTTAT
P       ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672|     15-,8+,12-,8+,12-,8+    *

If I follow the P values from the second gfa I can recreate the input fasta sequence. However, the size of the recreated sequence from the first gfa would be shortened by 4 bases, I think because of the repetitive sequences. Not sure how twopaco handles it (my hunch is by maintaining the reference sequence info ?) but I think this is a problem and mot sure how can we tackle it.

Let me know if you guys have any thoughts.

DnaStringSlice should abstract away the backing DnaString rc status

The DnaStringSlice implementation is inconsistent when the slice is reverse complemented. Mer::get() is implemented (what I consider to be) correctly, but other functions only work if the slice has is_rc == false.

#[test]
fn dnastringslice_get_kmer() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    for i in 0..=3 {
        // The kmer in a slice should be the kmer of the sequencing represented
        // by that slice, regardless of whether the backing DnaString is RC or not.
        assert_eq!(slice.get_kmer::<Kmer4>(i), rcslice.get_kmer::<Kmer4>(i));
    }
}
#[test]
fn dnastringslice_slice() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    // The fact that a slice is backed by a DnaStringSlice that is
    // the rc of the slice sequence shouldn't matter.
    assert_eq!(rcslice.slice(1, 4), slice.slice(1, 4));
}

On a related topic, would it make sense to split the 2-bit encoding of a DNA string and associated kmer logic into their own crate? Two bit encoding and kmer counting is generically useful outside of de bruijn graph construction and is used for everything from kmer counting, error correction, mimimizer hash tables, to reference genome storage (e.g. sequence interval lookups in a memory maped 2bit encoded (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) reference genome would be very efficient but unfortunately, you've chosen a different packed encoding^).

^ The UCSC encoding uses a bit encoding in which the MSB indicate a purine base, so complementing the sequence is XORing with 0xAAAAAAAA instead of flipping all bits which is the approach used here.

10xgenomics / rust-debruijn Goto Github PK

rust-debruijn's Introduction

rust-debruijn

Key features

rust-debruijn's People

Contributors

Stargazers

Watchers

Forkers

rust-debruijn's Issues

Documentation clarification

Should we slice when continuous shards are going to the same bucket ?

max_path ignores path validity when choosing next node

Problem with the repetitive sequence

DnaStringSlice should abstract away the backing DnaString rc status

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent