Git Product home page Git Product logo

stringdistances.jl's Introduction

Build status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)

The available distances are:

Syntax

Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.

Computing the distance between two strings (or iterators)

You can always compute a certain distance between two strings using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as

r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)

The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Computing the distance between two AbstractVectors of strings (or iterators)

Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:

pairwise(dist, X, Y)

For instance,

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)

Find closest string

The package also adds convenience functions to find elements in a iterator of strings closest to a given string

  • findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

     findnearest(s, itr, dist)
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum score. Its syntax is:

     findall(s, itr, dist; min_score = 0.8)

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.

fuzzywuzzy

The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

  • Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
  • TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
  • TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
  • TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy
Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0

Notes

  • All string distances are case sensitive.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.