Light

seqan / minions Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 2.0 12.49 MB

Comparison between different methods to simplify sequence data

License: BSD 3-Clause "New" or "Revised" License

CMake 2.00% C++ 71.40% Python 26.59%

k-mer minimizer set-membership

minions's Introduction

SeqAn - The Library for Sequence Analysis

NOTE SeqAn3 is out and hosted in a different repository
We recommend using SeqAn3 for new applications.

What Is SeqAn?

SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and integration with other libraries. SeqAn is easy to use and simplifies the development of new software tools with a minimal loss of performance.

License

The SeqAn library itself, the tests and demos are licensed under the very permissive 3-clause BSD License. The licenses for the applications themselves can be found in the LICENSE files.

Prerequisites

Older compiler versions might work but are neither supported nor tested.

Linux, macOS, FreeBSD

GCC ≥ 11
Clang/LLVM ≥ 15
Intel oneAPI C++ Compiler 2024.0.2 (IntelLLVM)

Windows

Visual C++ ≥ 17.0 / Visual Studio ≥ 2022

Architecture support

Intel/AMD platforms, including optimisations for modern instruction sets (POPCNT, SSE4, AVX2, AVX512)
All Debian release architectures supported, including most ARM and all PowerPC platforms.

Build system

To build tests, demos, and official SeqAn applications you also need CMake ≥ 3.12.

Some official applications might have additional requirements or only work on a subset of platforms.

Documentation Resources

Contact

minions's People

Contributors

Stargazers

Watchers

Forkers

hosseinem matanatmammadli

minions's Issues

Get positions of subsequences

In order to analyze the coverage of the subsequences over a sequence, it is helpful to know their positions. There are two ways to do so:
a) Write a second function to every function returning the positions instead of the subsequences.
b) Write one function, getting the sequence and the subsequences and align them in order to get the positions.

The drawback of b) is that it takes more time to compute because the initial function needs to be called and then an alignment is applied, but its advantage is that it is only one function to implement, which then should work with every method.

For now, I think b) is the easier option. Should it be necessary to get the positions in a more efficient way, a) can still be implemented.

Including both strands in syncmer_hash using minimum hash

It seems that including the reverse strand by selecting a minimum value of hashes is not doable since the syncmers are being chosen based on the Smers. A syncmer is a Kmer that has its smallest smer at its start or end and in a sense, Smers should correspond to Kmers to make this work. Hence, it is not possible to select minimum hashes of Smers and Kmers independently. It is possible to select the minimum kmer_hash first and then check its corresponding Smers but this would be time-consuming and needs double the number of vectors, ranges, and iterators.

Add Raptor as a submodule

Add Raptor as a submodule to make the same comparison as in the Raptor paper.
Note: The search probably needs to be adapted to the different methods.

Gapped k-mers and syncmers

The syncmer implementation is not easily compatible with gapped k-mers, because we would need the actual shape used. Unless the views of smers could be adapted beforehand accordingly.

Syncmers with same minimum

It is not quite clear, what to do, if the minimum appears multiple times in a submer, what is then the minimum? The first occurrence or all occurrences?
If I understand GetMinSubkmerPos in https://github.com/rcedgar/syncmer/blob/master/kmer.cpp correctly, then only the first occurrence is seen as minimum. Therefore, for an offset of 1 and k=4, s=2 the sequence AAAA would not be a syncmer.

Add the following methods

syncmers
strobemers (has its own library, can just be added by that?)
weighted minimizers

Finding a good and representative set of k-mers

Minstrobe_hash does not work with iterator test

Error relates to debug_stream and that it can not print the range, which should be capable of.

How to compare

The methods are compared by their capability to represent the sequence data as well as their performance.

distribution
minimal, average and maximal gap between elements
speed
RAM usage
number of elements created (compression factor? number of elements divided by number of k-mers)
capability to find transcripts (True Positives, False Positives, True Negatives, False Negatives)
conservation (how many minimizers stay the same when sequence is slightly mutated, conservation should not be based solely on numbers of minimizers, but counting number of nucleotides covered in order to account for overlapping minimizers, see syncmer paper)
distance on mutated sequence compated to not mutated sequence
Handling of repitive and errorneous k-mers (simple cutoffs vs none vs weighting)

In the strobemer paper, they use two simulated data sets and define:

the number of matches, where a match between sequence A and mutated sequence A' is an idential subsequence at position i
the positions covered by the subsequences
an island: a maximal interval of consequtive positions not covered as an island
analysis how many submers are unqiue in a given genome and how similar these submers are to each other (edit distance)

Also, they check, how many unique subsequences there are in the five largest human chromosome to measure the precision of a method.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.