biojulia / genomegraphs.jl Goto Github PK
View Code? Open in Web Editor NEWA modern genomics framework for julia
Home Page: https://biojulia.dev
License: Other
A modern genomics framework for julia
Home Page: https://biojulia.dev
License: Other
We need to be able to tell if a genome assembly has correctly incorporated all of the motifs sampled into the graph. We do this through a comparison of Kmer Spectras.
We have to implement functions for merging nodes and collapsing edges on a simple path on a DeBruijnGraph type graph.
To do this following steps are important:
We have to make sure that the query functions implemented are working properly. Especially we have to check compatibility of functions with different primitive types dnasequence and kmers.
More testing is necessary for the following functions:
We need a process to map short reads to the GenomeGraph.
This should be fairly simple to achieve.
Paired reads are basically very accurate, and so it should be possible to map them reasonably well using an index of unique kmers in the graph.
We need the ability to save WorkSpaces to file and to be able to load them again.
We need to implement a basic de-bruijn graph type, along with constructors and basic manipulation methods.
I am not sure if this is also a design concept or not but I think there is a typo in add_node function in SequenceGraph.jl:
function add_node!(sg::SequenceGraph, n::SequenceGraphNode)
newlen = length(push!(nodes(sg)))
resize!(links(sg), newlen)
return newlen
end
I think the push function should also include n as follows:
newlen = length(push!(nodes(sg),n))
I am pushing an updated issue to the gsoc branch
The method of initial graph construction from kmers that is currently used is much more efficient now.
However, it still suffers from the circle problem in that kmers that have themselves as a de-bruijn neighbour forwards and backwards are not included in the unititgs.
This is a fairly known and simple problem to solve.
But since it's very very rare in real data, right now the algorithm just warns you. But we should fix it later.
Generating Contigs using UG and Reads
We would like to generate contigs using the unitigs on the dbg and the reads given as the initial input for the dbg construction. This step of the project is more challenging compared to the previous two steps : DBG construction and DBG-to-UG.
We should decide on the core functionalities required during the contig generation and start implementing them. Right now, the dbg constructor takes as input a set of kmers in their canonical form.
First, we need to generalize the dbg constructor by having a preprocessing step for the raw reads to be represented as a set of canonical kmers. The preprocessing step should enumerate all unique canonical kmers, from a set of reads.
We should add some adversarial testing datasets either to BioSequenceGraphs.
They may take up some space and so we'll have to think about how to do this with bigger datasets.
But for now, I propose adding a PhiX dataset: We can use the PhiX reference genome sequence. Use Pseudoseq.jl to generate paired-end reads. We will need to decide on a read length and average insert size. Once we have the read files we can include the reference and the reads, and use that data to test how our graph functions are working.
I think "is_forward_link" function has a typo.
is_forward_link(l::SequenceGraphLink, n::NodeID) = source(l) == -n
'-n' should be replaced with 'n'.
I thought maybe there is a special reason behind this but could not figure it out.
I will push a new branch along with some other minor updates to the package.
I think it would be nice to include some error correction functionalities before generating contigs. This will both enable us to work with real (error containing) data and also allow researchers who would like to do only error correction. Below I list some of the error correction functions I am planning to implement to simplify the de bruijn graph:
Trimming dead-end tips : We remove all tips with no outgoing edge. Tip refers to an edge from a node (with multiple outgoing edges), where the destination of the edge has no outgoing edges. These nodes are treated as errors that occur at the end of a read.
Popping bubbles : Two path that diverge from a single node and then merge into another node. In such a case one of the paths are removed from the graph. Usually the removed path has a low coverage (depth) and treated as an error that occurred in the middle of a read.
Removing chimeric edges : Edges that cross across two simple paths. Such edges usually have low coverage and removed from the graph.
<@re_str not defined Error -->
I have cloned the repository to my local.
When including the package with "include("BioSequenceGraphs.jl")" I get the following error:
LoadError: LoadError: UndefVarError: @re_str not defined
For now, I have just removed the line
include("GFA1/GFA1.jl")
from BioSequenceGraphs.jl to be able to run the package smoothly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.