Git Product home page Git Product logo

match's Introduction

match

Rabin–Karp algorithm in C++ to find recurring substrings in a string and use them to construct an edit list.

Overview

This is a single header C++ implementation of the Rabin–Karp algorithm to find recurring substrings in a string. This code finds self-matching substrings using a rolling hash known as a Rabin–Karp signature, which is a hash sum product, composed and recorded in a hashtable incrementally using multiple string length combinations offset from an increasing mark. These signatures are then looked up in the hashtable at each step to find matching substrings.

Rabin–Karp algorithm illustration

The algorithm records offsets of substring occurances in a hash table ('head') indexed by the Rabin–Karp signature and uses the hash table information to compose a chain of occurances in a second table ('prev'), indexed by offset, which leads from the current match to the last match. Each position in the previous match table contains an offset to the last match for the same signature recorded at a prior offset.

The search algorithm, after updating the hash table and previous match table, checks for a hit, and if there is a hit, it follows the chains of past offsets leading through all prior occurances of a particular Rabin–Karp signature, all the way back to the earliest match. The algorithm then verifies these approximate matches, and if valid, creates copy instructions for the best match, or, alternatively emits literal instructions for new text. The matches are approximate because there is a collision probability, however, there are multiple offsets that can be used to match a prior occurance, which aids with the probabalistic nature of the algorithm.

The output is a list of instructions referencing new data or copies of previous data.

  • Vector<Match>
    • Type type (* Literal or Copy *)
    • Size offset
    • Size length

Example

The following is an example invocation of the included test program:

Test command:

$ ./build/match -v -t TGGGCGTGCGCTTGAAAAGAGCCTAAGAAGAGGGGGCGTCTGGAAGGAACCGCAACGCCAAGGGAGGGTG

Expected output:

OriginalText: TGGGCGTGCGCTTGAAAAGAGCCTAAGAAGAGGGGGCGTCTGGAAGGAACCGCAACGCCAAGGGAGGGTG
[  0] : Literal [   0,  7 )   # "TGGGCGT"
[  1] :    Copy [   4,  3 )   # "GCG"
[  2] : Literal [   0, 14 )   # "CTTGAAAAGAGCCT"
[  3] :    Copy [   8,  4 )   # "AAGA"
[  4] :    Copy [  11,  4 )   # "AGAG"
[  5] :    Copy [  31,  3 )   # "GGG"
[  6] :    Copy [  32,  4 )   # "GCGT"
[  7] : Literal [   0,  1 )   # "C"
[  8] :    Copy [  40,  3 )   # "TGG"
[  9] :    Copy [  27,  3 )   # "AAG"
[ 10] :    Copy [  33,  3 )   # "GAA"
[ 11] : Literal [   0,  6 )   # "CCGCAA"
[ 12] :    Copy [   5,  3 )   # "CGC"
[ 13] :    Copy [   6,  3 )   # "CAA"
[ 14] :    Copy [  60,  3 )   # "GGG"
[ 15] : Literal [   0,  1 )   # "A"
[ 16] :    Copy [  64,  3 )   # "GGG"
[ 17] : Literal [   0,  2 )   # "TG"
DataSize/Literals/Copies: 70/31/39
OuterIterations/InnerIterations: 1018/750

match's People

Contributors

michaeljclark avatar

Watchers

 avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.