Git Product home page Git Product logo

tetrex's Introduction

Search for Regular Expressions in large datasets

Despite the efficiency of modern day tools for Regular Expression search, their runtime is often dominated by the size of the text. We present TetRex, a novel algorithm for regular expression matching that leverages the (Hierarchical) Interleaved Bloom Filter as an index. Regular Expressions are given as input in the command line and support the following operations:

  1. | - Or
  2. * - Zero or more repetitions
  3. + - One or More repetitions
  4. ? - Optional Character

Installation

  1. Clone the repository with git clone --recurse-submodules [email protected]:remyschwab/TetRex.git
  2. Descend into the home directory and input: mkdir build && cd build
  3. Configure with cmake cmake -DCMAKE_CXX_COMPILER=/path/to/g++-11 ..
  4. Build with make make

Usage

Tetrix offers two main commands [index & query] and one utility command [inspect]:

## Construct an HIBF index of nucleic acids, with kmer size = 3, 3 hash functions, & a FPR of 0.05, each input file represents a bin
tetrex index -m na -k 3 -o test data/dna_example_split/*.fa
## Query RegEx
tetrex query test.ibf "A(C+|G+)T" 
#TetRex/data/dna_example_split/sequence1.fa >Sequence1      ACT
#TetRex/data/dna_example_split/sequence1.fa >Sequence1      ACT
#TetRex/data/dna_example_split/sequence1.fa >Sequence1      ACT
#TetRex/data/dna_example_split/sequence2.fa >Sequence2      ACT
#TetRex/data/dna_example_split/sequence2.fa >Sequence2      AGT
#TetRex/data/dna_example_split/sequence4.fa >Sequence4      ACCCT
## Report some info about a constructed index
tetrex inspect test.ibf
# Reading Index from Disk... DONE in 3.1e-05s
# INDEX TYPE: HIBF
# FALSE POSITIVE RATE: 0.05
# HASH COUNT (hash functions): 3
# KMER LENGTH (bases): 3
# MOLECULE TYPE (alphabet): Nucleic Acid [REDUCTION=NONE]
# ACID LIBRARY (filepaths):
#         - /Users/rschwab/repos/TetRex/./data/dna_example_split/sequence1.fa
#         - /Users/rschwab/repos/TetRex/./data/dna_example_split/sequence2.fa
#         - /Users/rschwab/repos/TetRex/./data/dna_example_split/sequence3.fa
#         - /Users/rschwab/repos/TetRex/./data/dna_example_split/sequence4.fa
#         - /Users/rschwab/repos/TetRex/./data/dna_example_split/sequence5.fa
# DONE

Notes

This app was generated from the SeqAn App Template and makes heavy use of the SeqAn library.

tetrex's People

Contributors

remyschwab avatar vmusch avatar sgssgene avatar

Watchers

 avatar  avatar

tetrex's Issues

Stop using SeqAn alphabets

Many subroutines spend too much team converting between SeqAn3 alphabets and std::string... Just use std::string

This will require encodings for dna, reduced amino acid alphabets, and expanded amino acid alphabets into uint64_t so that it can be hashed and indexed. Maybe something like this

Multiple Queries, One Command

Imagine a case where one peptide should have multiple small distal motifs. Need to intersect which peptides contain all of the queries.

Wildcards

Can we use auxiliary BFs to avoid exponential number of kmers? Can reduced amino acid alphabets help?

Report coordinates of regex match

In order to be consistent with info reported by prosite we should at least report start coordinates of a regex match. I'm sure this is possible with re2 but I just haven't figured it out yet

This may slightly increase runtime!

IBF FPR

As of now, the IBF is constructed by giving the bin size of a Bloom Filter whereas for the HIBF the user inputs the FPR... Is this ok?

Plasmid Search

Plasmids and bacterial genomes can be circular. We should enable a flag during indexing that will treat fasta sequences as circular.

Multithreading

Paths through the kNFA should be easy to query in parallel. Same for verification of a bin!

Profile word search

Generate all kmers that reach a threshold from a profile and search for them

Internal Query Preprocessing

After the user inputs a query, we should internally split it between what is used to generate a kNFA and what is used during verification by RE2.

Some motifs contain anchors like < and > ie. ^ and $ for motifs that occur at the N and C terminus. This can be ignored by the IBF but not RE2.

Also the entire regex should be in a capture group (REGEX) so that the match can be reported as a whole

Add new residue to kmer on push instead of on pop?

Right now, a new amino acid or nucleotide is added to the kmer when a CollectionItem is popped from the queue. I think this is causing absorption to happen slightly later than necessary. Maybe it would be smarter to shift the kmer over before pushing? And extract the subhash then? Idk....

Match Forward and Reverse Strand during DNA verification

The kmer filtering step works on canonical kmers but the verification step doesn't check both strands for a match. So, right now, it's possible to get a hit and not find it during verification! What's the best way to match 2 regex at the same time with RE2?

Probably this

Warnings and Checks

Like if someone writes a query that's too short etc... uses a wrong character idk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.