Git Product home page Git Product logo

minions's Introduction

SeqAn - The Library for Sequence Analysis

build status license latest release platforms start twitter

NOTE
SeqAn3 is out and hosted in a different repository
We recommend using SeqAn3 for new applications.

What Is SeqAn?

SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and integration with other libraries. SeqAn is easy to use and simplifies the development of new software tools with a minimal loss of performance.

License

The SeqAn library itself, the tests and demos are licensed under the very permissive 3-clause BSD License. The licenses for the applications themselves can be found in the LICENSE files.

Prerequisites

Older compiler versions might work but are neither supported nor tested.

Linux, macOS, FreeBSD

  • GCC ≥ 11
  • Clang/LLVM ≥ 15
  • Intel oneAPI C++ Compiler 2024.0.2 (IntelLLVM)

Windows

  • Visual C++ ≥ 17.0 / Visual Studio ≥ 2022

Architecture support

  • Intel/AMD platforms, including optimisations for modern instruction sets (POPCNT, SSE4, AVX2, AVX512)
  • All Debian release architectures supported, including most ARM and all PowerPC platforms.

Build system

  • To build tests, demos, and official SeqAn applications you also need CMake ≥ 3.12.

Some official applications might have additional requirements or only work on a subset of platforms.

Documentation Resources

Contact

minions's People

Contributors

hosseinem avatar mitradarja avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

minions's Issues

Get positions of subsequences

In order to analyze the coverage of the subsequences over a sequence, it is helpful to know their positions. There are two ways to do so:
a) Write a second function to every function returning the positions instead of the subsequences.
b) Write one function, getting the sequence and the subsequences and align them in order to get the positions.

The drawback of b) is that it takes more time to compute because the initial function needs to be called and then an alignment is applied, but its advantage is that it is only one function to implement, which then should work with every method.

For now, I think b) is the easier option. Should it be necessary to get the positions in a more efficient way, a) can still be implemented.

Including both strands in syncmer_hash using minimum hash

It seems that including the reverse strand by selecting a minimum value of hashes is not doable since the syncmers are being chosen based on the Smers. A syncmer is a Kmer that has its smallest smer at its start or end and in a sense, Smers should correspond to Kmers to make this work. Hence, it is not possible to select minimum hashes of Smers and Kmers independently. It is possible to select the minimum kmer_hash first and then check its corresponding Smers but this would be time-consuming and needs double the number of vectors, ranges, and iterators.

Add Raptor as a submodule

Add Raptor as a submodule to make the same comparison as in the Raptor paper.
Note: The search probably needs to be adapted to the different methods.

Gapped k-mers and syncmers

The syncmer implementation is not easily compatible with gapped k-mers, because we would need the actual shape used. Unless the views of smers could be adapted beforehand accordingly.

Syncmers with same minimum

It is not quite clear, what to do, if the minimum appears multiple times in a submer, what is then the minimum? The first occurrence or all occurrences?
If I understand GetMinSubkmerPos in https://github.com/rcedgar/syncmer/blob/master/kmer.cpp correctly, then only the first occurrence is seen as minimum. Therefore, for an offset of 1 and k=4, s=2 the sequence AAAA would not be a syncmer.

How to compare

The methods are compared by their capability to represent the sequence data as well as their performance.

  • distribution
  • minimal, average and maximal gap between elements
  • speed
  • RAM usage
  • number of elements created (compression factor? number of elements divided by number of k-mers)
  • capability to find transcripts (True Positives, False Positives, True Negatives, False Negatives)
  • conservation (how many minimizers stay the same when sequence is slightly mutated, conservation should not be based solely on numbers of minimizers, but counting number of nucleotides covered in order to account for overlapping minimizers, see syncmer paper)
  • distance on mutated sequence compated to not mutated sequence
  • Handling of repitive and errorneous k-mers (simple cutoffs vs none vs weighting)

In the strobemer paper, they use two simulated data sets and define:

  • the number of matches, where a match between sequence A and mutated sequence A' is an idential subsequence at position i

  • the positions covered by the subsequences

  • an island: a maximal interval of consequtive positions not covered as an island

  • analysis how many submers are unqiue in a given genome and how similar these submers are to each other (edit distance)

Also, they check, how many unique subsequences there are in the five largest human chromosome to measure the precision of a method.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.