10xgenomics / rust-debruijn Goto Github PK
View Code? Open in Web Editor NEWDe Bruijn graphs in Rust
License: MIT License
De Bruijn graphs in Rust
License: MIT License
The DnaStringSlice
implementation is inconsistent when the slice is reverse complemented. Mer::get()
is implemented (what I consider to be) correctly, but other functions only work if the slice has is_rc == false
.
#[test]
fn dnastringslice_get_kmer() {
let seq = DnaString::from_dna_string("ACGGTAC");
let seqrc = DnaString::from_dna_string("GTACCGT");
let rcslice = seq.slice(0, 7).rc();
let slice = seqrc.slice(0, 7);
for i in 0..=3 {
// The kmer in a slice should be the kmer of the sequencing represented
// by that slice, regardless of whether the backing DnaString is RC or not.
assert_eq!(slice.get_kmer::<Kmer4>(i), rcslice.get_kmer::<Kmer4>(i));
}
}
#[test]
fn dnastringslice_slice() {
let seq = DnaString::from_dna_string("ACGGTAC");
let seqrc = DnaString::from_dna_string("GTACCGT");
let rcslice = seq.slice(0, 7).rc();
let slice = seqrc.slice(0, 7);
// The fact that a slice is backed by a DnaStringSlice that is
// the rc of the slice sequence shouldn't matter.
assert_eq!(rcslice.slice(1, 4), slice.slice(1, 4));
}
On a related topic, would it make sense to split the 2-bit encoding of a DNA string and associated kmer logic into their own crate? Two bit encoding and kmer counting is generically useful outside of de bruijn graph construction and is used for everything from kmer counting, error correction, mimimizer hash tables, to reference genome storage (e.g. sequence interval lookups in a memory maped 2bit encoded (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) reference genome would be very efficient but unfortunately, you've chosen a different packed encoding^).
^ The UCSC encoding uses a bit encoding in which the MSB indicate a purine base, so complementing the sequence is XORing with 0xAAAAAAAA instead of flipping all bits which is the approach used here.
I'm new to rust and I have some API clarfications that I am unsure if it's a missing feature in this library, or something I've overlooked. Clarification or examples on the following points would be greatly appreciated:
hamming_distance()
requires a DnaString
which means I need to convert to Vec<u8>
then DnaString
to perform what could be an only slightly more expensive operation than the current hamming_distance
implementation.
I feel like I'm missing something really obvious here. I'm trying to do something like Kmer16::string_iter(&slice)
but can't seem to find the right function or an example of this in the docs.
DnaString
from a rust-htslib
Seq
object?htslib uses a 4bit encoding and it seems the only approach is to decode back to the raw 8-bit u8 encoding then back 2bit format.
Thanks
Hi guys,
I was thinking of another optimization but wasn't sure of its impact on overall pipeline. The thought is the following:
If you look at this part of the msp bucketing, I propose to replace this with the following:
let mut min_positions = Vec::with_capacity(16);
let mut min_pos = find_min(0, k - p);
let mut min_pos_history = min_pos;
min_positions.push((0, min_pos));
for i in 0..(m - k + 1) {
if i > min_pos {
min_pos = find_min(i, i + k - p);
if pval(min_pos) != pval(min_pos_history) {
min_pos_history = min_pos;
min_positions.push((i, min_pos));
}
} else {
let test_min = pmin(min_pos, i + k - p);
if test_min != min_pos {
unreachable!();
//min_pos = test_min;
//min_positions.push((i, min_pos));
}
}
}
My thought was if continuous shards with different min_pos yet going to same bucket, do we have to splice the super sequence ? What I did here was maintain the first min_pos and if the next min_pos is going to the same bucket I don't add it as a splice point.
Please let me know your thoughts if it can create any corner case.
Hi guys,
May be I misunderstood something, but I think there could be a potential bug in the debruijn graph generation with super repetitive sequence. Imagine the following case:
AAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAA
It's a 45 length sequence with the following sequence A
+ AAAAT
* 8 + AAAA
.
I am curious what should be the right debruijn graph for this with 31 kmer size and strandedness as false (there is a potential problem with that too as msp and filter kmer assumes complementary definition of strandedness flag but assume we correct for it) i.e. I do want to canonicalize the kmers,
If I ran it correctly then this library gives the following gfa:
H VN:Z:debruijn-rs
S 0 AAAAATAAAATAAAATAAAATAAAATAAAAT
L 0 + 1 + 30M
S 1 AAAATAAAATAAAATAAAATAAAATAAAATAAAAT
P ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672| 0+,1+,1+ *
NOTE: I added the P
flag and it was not available by default although L
and S
are available by default.
However twopaco, another library gives the following entries:
H VN:Z:1.0
S 15 ATTTTATTTTATTTTATTTTATTTTATTTTT
S 8 AAAATAAAATAAAATAAAATAAAATAAAATAAA
S 12 ATTTTATTTTATTTTATTTTATTTTATTTTAT
P ENST00000375105.7|ENSG00000117215.14|OTTHUMG00000002701.1|OTTHUMT00000007683.1|PLA2G2D-001|PLA2G2D|2672|UTR5:1-59|CDS:60-497|UTR3:498-2672| 15-,8+,12-,8+,12-,8+ *
If I follow the P values from the second gfa I can recreate the input fasta sequence. However, the size of the recreated sequence from the first gfa would be shortened by 4 bases, I think because of the repetitive sequences. Not sure how twopaco handles it (my hunch is by maintaining the reference sequence info ?) but I think this is a problem and mot sure how can we tackle it.
Let me know if you guys have any thoughts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.