dguo / strsim-rs Goto Github PK

View Code? Open in Web Editor NEW

377.0 9.0 39.0 719 KB

:abc: Rust implementations of string similarity metrics

Home Page: https://crates.io/crates/strsim

License: MIT License

Rust 97.33% Python 2.67%

jaro-winkler levenshtein jaro damerau-levenshtein rust edit-distance

strsim-rs's Introduction

strsim-rs

Rust implementations of string similarity metrics:

Hamming
Levenshtein - distance & normalized
Optimal string alignment
Damerau-Levenshtein - distance & normalized
Jaro and Jaro-Winkler
Sørensen-Dice

The normalized versions return values between 0.0 and 1.0, where 1.0 means an exact match.

There are also generic versions of the functions for non-string inputs.

Installation

strsim is available on crates.io. Add it to your project:

cargo add strsim

Usage

Go to Docs.rs for the full documentation. You can also clone the repo, and run $ cargo doc --open.

Examples

extern crate strsim;

use strsim::{hamming, levenshtein, normalized_levenshtein, osa_distance,
             damerau_levenshtein, normalized_damerau_levenshtein, jaro,
             jaro_winkler, sorensen_dice};

fn main() {
    match hamming("hamming", "hammers") {
        Ok(distance) => assert_eq!(3, distance),
        Err(why) => panic!("{:?}", why)
    }

    assert_eq!(levenshtein("kitten", "sitting"), 3);

    assert!((normalized_levenshtein("kitten", "sitting") - 0.571).abs() < 0.001);

    assert_eq!(osa_distance("ac", "cba"), 3);

    assert_eq!(damerau_levenshtein("ac", "cba"), 2);

    assert!((normalized_damerau_levenshtein("levenshtein", "löwenbräu") - 0.272).abs() <
            0.001);

    assert!((jaro("Friedrich Nietzsche", "Jean-Paul Sartre") - 0.392).abs() <
            0.001);

    assert!((jaro_winkler("cheeseburger", "cheese fries") - 0.911).abs() <
            0.001);

    assert_eq!(sorensen_dice("web applications", "applications of the web"),
        0.7878787878787878);
}

Using the generic versions of the functions:

extern crate strsim;

use strsim::generic_levenshtein;

fn main() {
    assert_eq!(2, generic_levenshtein(&[1, 2, 3], &[0, 2, 5]));
}

Contributing

If you don't want to install Rust itself, you can run $ ./dev for a development CLI if you have Docker installed.

Benchmarks require a Nightly toolchain. Run $ cargo +nightly bench.

License

MIT

strsim-rs's People

Contributors

Stargazers

Watchers

strsim-rs's Issues

Consider following crates.io's own semver flavour

I see in the changelog file that this crate attempts to follow semver.

AFAIK, 0.10.0 was released with no breaking changes, only a new feature.

Filing this only in case you are not aware of a couple of facts.

Semantic Versioning itself says that 0.y.z has no rules at all, and that you should not expect any 0.y.z version to be compatible with another 0.y.z version.

On the other hand, the crates ecosystem and cargo both state that it considers the leftmost non-zero number to be the major version. So 0.1.1 is compatible with 0.1.0, but 0.2.0 is not compatible with 0.1.1.

So for example, AFAIK 0.10.0 should have been 0.9.4, if the changelog file is correct.

Please don't include the generated docs in the repository

strsim includes a complete copy of the generated documentation in the repository, and thus in the crate source. Please consider dropping that copy, in favor of letting cargo build it (cargo doc) or using docs.rs .

Having a generated copy of the documentation in the source makes distribution packaging more difficult, as doing so embeds various files that distributions prefer to maintain a single copy of (such as the various javascript files, css files, and fonts).

Project Status

Hello! Is this crate still maintained? I haven't see any update since 2021.

examples in README fail assertion

Not sure whether this was intentional but two assertions fail in the README example. These thresholds, however, work:

    assert!((jaro("Friedrich Nietzsche", "Jean-Paul Sartre") - 0.392).abs() < 0.00012);
    assert!((jaro_winkler("cheeseburger", "cheese fries") - 0.911).abs() < 0.00012);

Performance of Damerau Levenshtein is low

I am unsure as to the cause of this. I know the DL algorithm is significantly more complex than the OSA algorithm but some crude benchmarks show the results below when simply comparing "alcohol" and "acloholism".

I just wanted to post this to ensure the algorithm has the correct implementation given this package's large usage rate.

Times below are average benchmarks in nanoseconds.

Incremental calculation of Levenshtein distance

I have some code for calculating the Levenshtein distance between a given string and a second string given character-by-character by reusing previous calculations, effectively generalizing strsim::levenshtein. Is this of interest for this project?

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4

It seems that the algorithm implemented here and called "Jaro-Winkler" is in fact not the usual "Jaro-Winkler" metric due to a design decision or misunderstanding/bug of the original intention of the algorithm.

The original Jaro-Winkler limits the prefix to <=4 - for a good reason: otherwise strings with identical common prefixes longer than 10 go above 1 (if one uses 0.1 for p).

The implementation here uses an arbitrary prefix length and instead hard-clips similarities to 1 when they are above.

Here is wikipedia on this topic: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_similarity

This causes issues like here, where AlignmentScorr is supposedly equal in distance to AlignmentScore and AlignmentStart - which it obviously isn't, see: clap-rs/clap#4660.

I feel that an algorithm should be implemented the way it was intended by the author. It could be called Jaro-Winkler-Guo after the author of this crate @dguo but there should definitely be a normal Jaro-Winkler without modification.

Unfortunately this crate, despite very heavily used by many, seems to be unmaintained. It would be nice if it could be maintained again, given that it's relied on by many important crates in the rust ecosystem. Maybe @dguo can start a call for a co-maintainer?

please reconsider the addition of ndarray and hashbrown as dependencies

I left a comment here on the commit in which these deps were introduced, but I'm not sure if it was seen: d6717db#r33186236

So I'd like to open an issue to increase its visibility. In particular, it would be great if strsim could remain fairly lightweight since it is depended on by a bunch of crates via clap. I'm hoping that we, as an ecosystem, can be a bit more judicious in adding dependencies to crates.

For hashbrown, which is awesome, it is going to be included in the standard library very soon. Folks upgrading their Rust version will automatically get this optimization soon enough. I don't think it's worth adding it as an explicit dependency for a small gain for users on older versions of Rust, because it impacts compilation times for everyone.

For ndarray, it looks like it was just used for some small convenience. I don't think it's worth bringing ndarray in (as excellent as it is) along with all of its dependencies just for that.

[feature] implement similarity metrics for all kind of sequences, not only strings

It would be nice to be able to use this library to compute similarity metrics between different types, and not only strings.

For instance, one could compute the edit distance between two files by comparing their bytes, even if they do not contain valid UTF8 and thus cannot be represented as rust strings.

evaluate inclusive ranges

A couple of places ranges like the following 0..(end + 1) which can be expressed as 0..=end. This is generally more readable. However I noticed this can lead to larger code gen, since 0..=end has to perform a saturating add, while 0..(end + 1) can use a wrapping add.

This can negatively affect code size. E.g. when updating it in generic_levenshtein_distance I get the following binary sizes:
With 0..(end + 1)

File  .text     Size Crate
5.6%  97.7% 255.5KiB std
0.0%   0.4%     977B strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.8% 100.0% 261.4KiB .text section size, the file size is 4.4MiB

With 0..=end

File  .text     Size Crate
5.6%  97.5% 255.5KiB std
0.0%   0.6%   1.7KiB strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.8% 100.0% 262.1KiB .text section size, the file size is 4.4MiB

This should be evaluated on a case by case basis and probably we should add a comment for cases where we keep the old syntax, so this is not rewritten in a future cleanup without rechecking the code gen.

Jaro Winkler transpositions handled incorrectly.

When comparing two strings that have a lot of transposed characters, the score comes out slightly wrong.

For example: the strings "a jke" and "jane a k" produce a result of 0.6833, which corresponds to 4 matches and 1 transposition. However there are actually 2 transpositions, because the number of out of sequence characters is 4: "a", " ", "j", "e" are all out of order (and transpositions = 4 / 2). The actual score comes out to be 0.6.

please add a pinned Rust version to CI

The latest release of strsim (0.8) increased its minimum required Rust version to build. There is no problem with that, but in doing so, I'd now like to increase my minimum Rust version in my CI config to accommodate the strsim update. In order to do that, I need to know what Rust version strsim compiles with. In the ecosystem, one convention used to communicate this is to add a specific Rust version in the CI config. Here is an example: https://github.com/BurntSushi/byteorder/blob/bdcc6bf676a1ed17eae68257bfa4726a1f0ec068/.travis.yml#L4

Could you please add a pinned version to your CI config? (Note that I am not asking for any specific semver policy around the minimum Rust version supported, but rather, just a somewhat more structured means of communicating what version of Rust strsim compiles with.)

Add a `levenshtein_limit` function

A lot of times when checking if two strings are similar, they wind up being more similar than you care to actually check. For these cases it can be a significant performance boost to give up calculating the levenshtein distance at a specific limit (rather than calculating the complete distance and then clamping).

Would it be possible to add this feature to this crate?

I have a significantly less popular crate with these, I would be okay with the implementation being stolen :) https://docs.rs/stringmetrics/latest/stringmetrics/#limited-levenshtein-algorithm / https://docs.rs/stringmetrics/latest/stringmetrics/fn.try_levenshtein.html. (A missed optimization here is that you don't actually need the full-length vector if a limit is provided. Could actually be a const generic).

update to work on Rust stable

Are there any plans to get this crate working on Rust stable? It'd be great to use your implementation of Levenshtein in Docopt. :-) docopt/docopt.rs#120

Document Algorithms and Functions better

I am having a hard time understanding what all those functions of this library are used for and which one I should use.

It would be good to have a broad explanation of the algorithm, it's uses and maybe a link to some resources, that explains it in further detail. Maybe also list the advantages and disadvantages?

You could also include some real world examples

For example, how could I improve this simple search function with strsim, to not only find exact matches, but also very similar matches;

fn search(list: Vec<&str>, value: &str) -> Option<&str> {
    for x in &list {
        if x == value {
             return Some(x);
        }
     }
     return None;
}

Maybe this could be useful https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff

Normalized Levenshtein score

Hello!

While Levenshtein calculates the minimum number of a string "mutations" there are situations where it's useful to get normalized score, where returned value is calculated as the following:

1.0 - levenshtein(a, b) / max(a, b);

What do you think, is it worth adding such a function here? I could add that. At the same time, it could be a practice for me in writing Rust code 😊

Add a law-abiding version of Damerau-Levenshtein distance

There are two versions of the Damerau-Levenshtein distance. It looks like strsim only provides the first version, which is not a formal metric. It would be nice if strsim provides the second version as well, as this version does have the metric property.

As for why I need this: I'd like to use this library with a BK-tree, which needs a metric to work.

breaking change of Jaro-winkler boost

First off, thanks for thanks for updating the library after a long time @dguo @maxbachmann !

However, shouldn't the latest release be a minor version release and not a patch release given the breaking change introduced by changing the Jaro Winkler boost?

That is, shouldn't the latest release be 0.11.0 and not 0.10.1?

levensthein in the levensthein crate is faster

I benchmarked levenshtein("kitten", "sitting"), and those were the results:

test bench_levenshtein  ... bench:         201 ns/iter (+/- 10)
test bench_levenshtein2 ... bench:          92 ns/iter (+/- 3)

The first one uses strsim::levenshtein, the second one use levenshtein::levensthein.

Maintaining this crate

Hi @dguo !
Would you mind adding me (or someone else if they volunteer) as a maintainer of this crate on github and crates.io ?
This crate seems to have several active users, and you do not seem to have a lot of time for it. It would be a pity if the crate died.