Git Product home page Git Product logo

strsim-rs's Introduction

strsim-rs

Crates.io Crates.io CI status unsafe forbidden

Rust implementations of string similarity metrics:

The normalized versions return values between 0.0 and 1.0, where 1.0 means an exact match.

There are also generic versions of the functions for non-string inputs.

Installation

strsim is available on crates.io. Add it to your project:

cargo add strsim

Usage

Go to Docs.rs for the full documentation. You can also clone the repo, and run $ cargo doc --open.

Examples

extern crate strsim;

use strsim::{hamming, levenshtein, normalized_levenshtein, osa_distance,
             damerau_levenshtein, normalized_damerau_levenshtein, jaro,
             jaro_winkler, sorensen_dice};

fn main() {
    match hamming("hamming", "hammers") {
        Ok(distance) => assert_eq!(3, distance),
        Err(why) => panic!("{:?}", why)
    }

    assert_eq!(levenshtein("kitten", "sitting"), 3);

    assert!((normalized_levenshtein("kitten", "sitting") - 0.571).abs() < 0.001);

    assert_eq!(osa_distance("ac", "cba"), 3);

    assert_eq!(damerau_levenshtein("ac", "cba"), 2);

    assert!((normalized_damerau_levenshtein("levenshtein", "löwenbräu") - 0.272).abs() <
            0.001);

    assert!((jaro("Friedrich Nietzsche", "Jean-Paul Sartre") - 0.392).abs() <
            0.001);

    assert!((jaro_winkler("cheeseburger", "cheese fries") - 0.911).abs() <
            0.001);

    assert_eq!(sorensen_dice("web applications", "applications of the web"),
        0.7878787878787878);
}

Using the generic versions of the functions:

extern crate strsim;

use strsim::generic_levenshtein;

fn main() {
    assert_eq!(2, generic_levenshtein(&[1, 2, 3], &[0, 2, 5]));
}

Contributing

If you don't want to install Rust itself, you can run $ ./dev for a development CLI if you have Docker installed.

Benchmarks require a Nightly toolchain. Run $ cargo +nightly bench.

License

MIT

strsim-rs's People

Contributors

ameobea avatar dguo avatar felixonmars avatar gentoid avatar gnomeddev avatar ibuzpe9 avatar ignatenkobrain avatar lengyijun avatar llogiq avatar lovasoa avatar maxbachmann avatar mgeisler avatar nwagner84 avatar ov700 avatar robjtede avatar shnatsel avatar sstangl avatar wanzenbug avatar wdv4758h avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

strsim-rs's Issues

Consider following crates.io's own semver flavour

I see in the changelog file that this crate attempts to follow semver.

AFAIK, 0.10.0 was released with no breaking changes, only a new feature.

Filing this only in case you are not aware of a couple of facts.

Semantic Versioning itself says that 0.y.z has no rules at all, and that you should not expect any 0.y.z version to be compatible with another 0.y.z version.

On the other hand, the crates ecosystem and cargo both state that it considers the leftmost non-zero number to be the major version. So 0.1.1 is compatible with 0.1.0, but 0.2.0 is not compatible with 0.1.1.

So for example, AFAIK 0.10.0 should have been 0.9.4, if the changelog file is correct.

Please don't include the generated docs in the repository

strsim includes a complete copy of the generated documentation in the repository, and thus in the crate source. Please consider dropping that copy, in favor of letting cargo build it (cargo doc) or using docs.rs .

Having a generated copy of the documentation in the source makes distribution packaging more difficult, as doing so embeds various files that distributions prefer to maintain a single copy of (such as the various javascript files, css files, and fonts).

Project Status

Hello! Is this crate still maintained? I haven't see any update since 2021.

examples in README fail assertion

Not sure whether this was intentional but two assertions fail in the README example. These thresholds, however, work:

    assert!((jaro("Friedrich Nietzsche", "Jean-Paul Sartre") - 0.392).abs() < 0.00012);
    assert!((jaro_winkler("cheeseburger", "cheese fries") - 0.911).abs() < 0.00012);

Performance of Damerau Levenshtein is low

I am unsure as to the cause of this. I know the DL algorithm is significantly more complex than the OSA algorithm but some crude benchmarks show the results below when simply comparing "alcohol" and "acloholism".

I just wanted to post this to ensure the algorithm has the correct implementation given this package's large usage rate.

Times below are average benchmarks in nanoseconds.

CleanShot 2022-11-28 at 09 29 52@2x

Incremental calculation of Levenshtein distance

I have some code for calculating the Levenshtein distance between a given string and a second string given character-by-character by reusing previous calculations, effectively generalizing strsim::levenshtein. Is this of interest for this project?

BUG: Jaro-Winkler implementation is wrong for matching prefixes longer than 4

It seems that the algorithm implemented here and called "Jaro-Winkler" is in fact not the usual "Jaro-Winkler" metric due to a design decision or misunderstanding/bug of the original intention of the algorithm.

The original Jaro-Winkler limits the prefix to <=4 - for a good reason: otherwise strings with identical common prefixes longer than 10 go above 1 (if one uses 0.1 for p).

The implementation here uses an arbitrary prefix length and instead hard-clips similarities to 1 when they are above.

Here is wikipedia on this topic: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Jaro%E2%80%93Winkler_similarity

This causes issues like here, where AlignmentScorr is supposedly equal in distance to AlignmentScore and AlignmentStart - which it obviously isn't, see: clap-rs/clap#4660.

I feel that an algorithm should be implemented the way it was intended by the author. It could be called Jaro-Winkler-Guo after the author of this crate @dguo but there should definitely be a normal Jaro-Winkler without modification.

Unfortunately this crate, despite very heavily used by many, seems to be unmaintained. It would be nice if it could be maintained again, given that it's relied on by many important crates in the rust ecosystem. Maybe @dguo can start a call for a co-maintainer?

please reconsider the addition of ndarray and hashbrown as dependencies

I left a comment here on the commit in which these deps were introduced, but I'm not sure if it was seen: d6717db#r33186236

So I'd like to open an issue to increase its visibility. In particular, it would be great if strsim could remain fairly lightweight since it is depended on by a bunch of crates via clap. I'm hoping that we, as an ecosystem, can be a bit more judicious in adding dependencies to crates.

For hashbrown, which is awesome, it is going to be included in the standard library very soon. Folks upgrading their Rust version will automatically get this optimization soon enough. I don't think it's worth adding it as an explicit dependency for a small gain for users on older versions of Rust, because it impacts compilation times for everyone.

For ndarray, it looks like it was just used for some small convenience. I don't think it's worth bringing ndarray in (as excellent as it is) along with all of its dependencies just for that.

evaluate inclusive ranges

A couple of places ranges like the following 0..(end + 1) which can be expressed as 0..=end. This is generally more readable. However I noticed this can lead to larger code gen, since 0..=end has to perform a saturating add, while 0..(end + 1) can use a wrapping add.

This can negatively affect code size. E.g. when updating it in generic_levenshtein_distance I get the following binary sizes:
With 0..(end + 1)

File  .text     Size Crate
5.6%  97.7% 255.5KiB std
0.0%   0.4%     977B strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.8% 100.0% 261.4KiB .text section size, the file size is 4.4MiB

With 0..=end

File  .text     Size Crate
5.6%  97.5% 255.5KiB std
0.0%   0.6%   1.7KiB strsim
0.0%   0.0%     119B strsim_test
0.0%   0.0%     102B [Unknown]
5.8% 100.0% 262.1KiB .text section size, the file size is 4.4MiB

This should be evaluated on a case by case basis and probably we should add a comment for cases where we keep the old syntax, so this is not rewritten in a future cleanup without rechecking the code gen.

Jaro Winkler transpositions handled incorrectly.

When comparing two strings that have a lot of transposed characters, the score comes out slightly wrong.

For example: the strings "a jke" and "jane a k" produce a result of 0.6833, which corresponds to 4 matches and 1 transposition. However there are actually 2 transpositions, because the number of out of sequence characters is 4: "a", " ", "j", "e" are all out of order (and transpositions = 4 / 2). The actual score comes out to be 0.6.

please add a pinned Rust version to CI

The latest release of strsim (0.8) increased its minimum required Rust version to build. There is no problem with that, but in doing so, I'd now like to increase my minimum Rust version in my CI config to accommodate the strsim update. In order to do that, I need to know what Rust version strsim compiles with. In the ecosystem, one convention used to communicate this is to add a specific Rust version in the CI config. Here is an example: https://github.com/BurntSushi/byteorder/blob/bdcc6bf676a1ed17eae68257bfa4726a1f0ec068/.travis.yml#L4

Could you please add a pinned version to your CI config? (Note that I am not asking for any specific semver policy around the minimum Rust version supported, but rather, just a somewhat more structured means of communicating what version of Rust strsim compiles with.)

Add a `levenshtein_limit` function

A lot of times when checking if two strings are similar, they wind up being more similar than you care to actually check. For these cases it can be a significant performance boost to give up calculating the levenshtein distance at a specific limit (rather than calculating the complete distance and then clamping).

Would it be possible to add this feature to this crate?

I have a significantly less popular crate with these, I would be okay with the implementation being stolen :) https://docs.rs/stringmetrics/latest/stringmetrics/#limited-levenshtein-algorithm / https://docs.rs/stringmetrics/latest/stringmetrics/fn.try_levenshtein.html. (A missed optimization here is that you don't actually need the full-length vector if a limit is provided. Could actually be a const generic).

Document Algorithms and Functions better

I am having a hard time understanding what all those functions of this library are used for and which one I should use.

It would be good to have a broad explanation of the algorithm, it's uses and maybe a link to some resources, that explains it in further detail. Maybe also list the advantages and disadvantages?

You could also include some real world examples

For example, how could I improve this simple search function with strsim, to not only find exact matches, but also very similar matches;

fn search(list: Vec<&str>, value: &str) -> Option<&str> {
    for x in &list {
        if x == value {
             return Some(x);
        }
     }
     return None;
}

Maybe this could be useful https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff

Normalized Levenshtein score

Hello!

While Levenshtein calculates the minimum number of a string "mutations" there are situations where it's useful to get normalized score, where returned value is calculated as the following:

1.0 - levenshtein(a, b) / max(a, b);

What do you think, is it worth adding such a function here? I could add that. At the same time, it could be a practice for me in writing Rust code 😊

breaking change of Jaro-winkler boost

First off, thanks for thanks for updating the library after a long time @dguo @maxbachmann !

However, shouldn't the latest release be a minor version release and not a patch release given the breaking change introduced by changing the Jaro Winkler boost?

That is, shouldn't the latest release be 0.11.0 and not 0.10.1?

Maintaining this crate

Hi @dguo !
Would you mind adding me (or someone else if they volunteer) as a maintainer of this crate on github and crates.io ?
This crate seems to have several active users, and you do not seem to have a lot of time for it. It would be a pity if the crate died.

add Pass-Join for large lists of strings

I'd like to see other algorithms added that help support Big Data scenarios and large lists of strings that typically are handled ungracefully with Map Reduce or other ill-suited methods for string comparison.

One of those algorithms that I would like to see added would be Pass-Join.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.