Git Product home page Git Product logo

de-novo-assembly's Introduction

De-Novo-Assembly

Python, and R based implementation of De-Novo transcriptome assembly

The entire workflow is illustrated by the following image:

Figure: Workflow

First, the input chromosome string is split into randomly overlapping substrings of length K. We do not fix the coverage during this process. Thus, the number of substrings generate during this step varies according to the randomly generated starting indices for the substrings. The generated substrings are then fed into the component responsible for construction of a De Bruijn graph. The module constructing De Bruijn graph outputs a list of directed edges containing labels of source and destination nodes. Depending on the structure of the graph generated in the form of edge list in the previous step, Eulerian walks are used to generate one or more substrings depending on the number of components in the graph. These substrings are then stitched together to produce the reconstructed string. During the stitching process, a substring is merged with the previous substring in the list at the position of best match in terms of its prefix when compared to the suffix of the previous string. Depending on the various factors such as repetitions in the substrings, externally introduced errors, etc. the reconstructed string differs to a certain degree when compared to the original string. The number of mismatches in base-pair is counted when comparing original and reconstructed strings. To calculate the error percentage, we consider length corresponding to the shorter string among the original and reconstructed string. An alternative strategy for calculation of error might look at penalizing characters beyond the length of the minimum length string, in case the reconstructed string length is unequal to the original string. Lastly, to test the effect ofread errors during sequencing, we introduce 1% error in the original string, to analyze the subsequently reconstructed string.

Results

  • Relatively low error rate is seen when input sequence length is long enough to disambiguate repetition

Figure: Error rates for read length of size 6400

Figure: Error rates for read length of size 400 (left), 800(right)

Figure: Error rates for read length of size 1600(left), 6400(right)



  • Runtime performance worsens on error injection

Figure: Instance of runtime performance with varying k-mer size after (red) and before error injection (blue)



  • Error injection led to a slight increase in the error rates but again with very high correlation with pre-error injection rates.

Figure: Error performance with varying read length before (blue) and after error injection (red)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.