rapidfuzz / rapidfuzz Goto Github PK

View Code? Open in Web Editor NEW

2.4K 24.0 108.0 7.47 MB

Rapid fuzzy string matching in Python using various string metrics

Home Page: https://rapidfuzz.github.io/RapidFuzz/

License: MIT License

C++ 39.27% Python 38.73% Cython 19.72% CMake 1.47% Shell 0.07% C 0.74%

string-matching string-similarity string-comparison levenshtein python cpp levenshtein-distance

rapidfuzz's Introduction

Rapid fuzzy string matching in Python and C++ using the Levenshtein Distance

Description • Installation • Usage • License

Description

RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy. However there are a couple of aspects that set RapidFuzz apart from FuzzyWuzzy:

It is MIT licensed so it can be used whichever License you might want to choose for your project, while you're forced to adopt the GPL license when using FuzzyWuzzy
It provides many string_metrics like hamming or jaro_winkler, which are not included in FuzzyWuzzy
It is mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster, while still providing the same results. For detailed benchmarks check the documentation
Fixes multiple bugs in the partial_ratio implementation
It can be largely used as a drop in replacement for fuzzywuzzy. However there are a couple API differences described here

Requirements

Python 3.8 or later
On Windows the Visual C++ 2019 redistributable is required

Installation

There are several ways to install RapidFuzz, the recommended methods are to either use pip(the Python package manager) or conda (an open-source, cross-platform, package manager)

with pip

RapidFuzz can be installed with pip the following way:

pip install rapidfuzz

There are pre-built binaries (wheels) of RapidFuzz for MacOS (10.9 and later), Linux x86_64 and Windows. Wheels for armv6l (Raspberry Pi Zero) and armv7l (Raspberry Pi) are available on piwheels.

✖️ failure "ImportError: DLL load failed"

If you run into this error on Windows the reason is most likely, that the Visual C++ 2019 redistributable is not installed, which is required to find C++ Libraries (The C++ 2019 version includes the 2015, 2017 and 2019 version).

with conda

RapidFuzz can be installed with conda:

conda install -c conda-forge rapidfuzz

from git

RapidFuzz can be installed directly from the source distribution by cloning the repository. This requires a C++17 capable compiler.

git clone --recursive https://github.com/rapidfuzz/rapidfuzz.git
cd rapidfuzz
pip install .

Usage

Some simple functions are shown below. A complete documentation of all functions can be found here.
Note that from RapidFuzz 3.0.0, strings are not preprocessed(removing all non alphanumeric characters, trimming whitespaces, converting all characters to lower case) by default. Which means that when comparing two strings that have the same characters but different cases("this is a word", "THIS IS A WORD") their similarity score value might be different, so when comparing such strings you might see a difference in score value compared to previous versions. Some examples of string matching with preprocessing can be found here.

Scorers

Scorers in RapidFuzz can be found in the modules fuzz and distance.

Simple Ratio

> from rapidfuzz import fuzz
> fuzz.ratio("this is a test", "this is a test!")
96.55172413793103

Partial Ratio

> from rapidfuzz import fuzz
> fuzz.partial_ratio("this is a test", "this is a test!")
100.0

Token Sort Ratio

> from rapidfuzz import fuzz
> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
90.9090909090909
> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100.0

Token Set Ratio

> from rapidfuzz import fuzz
> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
84.21052631578947
> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
100.0

Weighted Ratio

> from rapidfuzz import fuzz
> fuzz.WRatio("this is a test", "this is a new test!!!")
85.5

> from rapidfuzz import fuzz, utils
> # Removing non alpha numeric characters("!") from the string
> fuzz.WRatio("this is a test", "this is a new test!!!", processor=utils.default_process) # here "this is a new test!!!" is converted to "this is a new test"
95.0
> fuzz.WRatio("this is a test", "this is a new test")
95.0

> # Converting string to lower case
> fuzz.WRatio("this is a word", "THIS IS A WORD")
21.42857142857143
> fuzz.WRatio("this is a word", "THIS IS A WORD", processor=utils.default_process) # here "THIS IS A WORD" is converted to "this is a word"
100.0

Quick Ratio

> from rapidfuzz import fuzz
> fuzz.QRatio("this is a test", "this is a new test!!!")
80.0

> from rapidfuzz import fuzz, utils
> # Removing non alpha numeric characters("!") from the string
> fuzz.QRatio("this is a test", "this is a new test!!!", processor=utils.default_process)
87.5
> fuzz.QRatio("this is a test", "this is a new test")
87.5

> # Converting string to lower case
> fuzz.QRatio("this is a word", "THIS IS A WORD")
21.42857142857143
> fuzz.QRatio("this is a word", "THIS IS A WORD", processor=utils.default_process)
100.0

Process

The process module makes it compare strings to lists of strings. This is generally more performant than using the scorers directly from Python. Here are some examples on the usage of processors in RapidFuzz:

> from rapidfuzz import process, fuzz
> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
> process.extract("new york jets", choices, scorer=fuzz.WRatio, limit=2)
[('New York Jets', 76.92307692307692, 1), ('New York Giants', 64.28571428571428, 2)]
> process.extractOne("cowboys", choices, scorer=fuzz.WRatio)
('Dallas Cowboys', 83.07692307692308, 3)

> # With preprocessing
> from rapidfuzz import process, fuzz, utils
> process.extract("new york jets", choices, scorer=fuzz.WRatio, limit=2, processor=utils.default_process)
[('New York Jets', 100.0, 1), ('New York Giants', 78.57142857142857, 2)]
> process.extractOne("cowboys", choices, scorer=fuzz.WRatio, processor=utils.default_process)
('Dallas Cowboys', 90.0, 3)

The full documentation of processors can be found here

Benchmark

The following benchmark gives a quick performance comparison between RapidFuzz and FuzzyWuzzy. More detailed benchmarks for the string metrics can be found in the documentation. For this simple comparison I generated a list of 10.000 strings with length 10, that is compared to a sample of 100 elements from this list:

words = [
    "".join(random.choice(string.ascii_letters + string.digits) for _ in range(10))
    for _ in range(10_000)
]
samples = words[:: len(words) // 100]

The first benchmark compares the performance of the scorers in FuzzyWuzzy and RapidFuzz when they are used directly from Python in the following way:

for sample in samples:
    for word in words:
        scorer(sample, word)

The following graph shows how many elements are processed per second with each of the scorers. There are big performance differences between the different scorers. However each of the scorers is faster in RapidFuzz

The second benchmark compares the performance when the scorers are used in combination with cdist in the following way:

cdist(samples, words, scorer=scorer)

The following graph shows how many elements are processed per second with each of the scorers. In RapidFuzz the usage of scorers through processors like cdist is a lot faster than directly using it. That's why they should be used whenever possible.

Support the project

If you are using RapidFuzz for your work and feel like giving a bit of your own benefit back to support the project, consider sending us money through GitHub Sponsors or PayPal that we can use to buy us free time for the maintenance of this great library, to fix bugs in the software, review and integrate code contributions, to improve its features and documentation, or to just take a deep breath and have a cup of tea every once in a while. Thank you for your support.

Support the project through GitHub Sponsors or via PayPal:

License

RapidFuzz is licensed under the MIT license since I believe that everyone should be able to use it without being forced to adopt the GPL license. That's why the library is based on an older version of fuzzywuzzy that was MIT licensed as well. This old version of fuzzywuzzy can be found here.

rapidfuzz's People

Contributors

Stargazers

Watchers

Forkers

deedee gjvignesh alfian878787 stvhanna nikolayvoronchikhin 8ggmaker wesley-weiming killvxk jingmouren sandy4321 arsenico13 zhangwei730 joaomcteixeira tchigher vicfred zxlzr gridl bigtonylewis iamrohank nicornk btharp alfonsodg akshatgoyalalgo remibeaupreara dish59742 odidev yuji-mizobuchi patrickropp caroheymes stjordanis slava715 henryiii phxntxm trendingtechnology amit08255 vickyvfq caropilardiaz mindaugasvaitkus2 rashesh2308 annuupadhyayps deepakdubey90 layday ethanyyao jkamlah databill86 keithcallenberg hugolmn pratheesh-prakash george0st nhanthien pekkarr brinkqiang2python masbudisulaksono mgorny derrick99 xkey- jlee17 kiranbeethoju rocke2020 sarkhelritesh ajunlonglive fighting41love xthyax 5l1v3r1 alvistack kzelias aris-lee-1126 mazlum safakcebin gg-big-org ndthong2411 guyrosin mbahmani hrnciar giladbarnea kaikunxu peerachetporkaew subhash505 rosperitus arpitjain799 boolean5 dheerajck hub-ram jleestr htdung167 biroka pyburgi hugovk yyleaves nonos12345 gerhobbelt gmh5225 sysfce2 farre onepercentmagic eva-val pi-kappa-devel wolfi-chainguard-demo ezhangle juliangilbey

rapidfuzz's Issues

ImportError: DLL load failed: The specified module could not be found

Hi, I use rapidfuzz for my project and one of my users posted the following error:

from rapidfuzz._fuzz import *
ImportError: DLL load failed: The specified module could not be found."

Just curious if anything has changed in your PyPI uploads between 0.7.3 (what I have locally) vs the current 0.9.1?

Appreciate this module very much....at least 5x faster than fuzzywuzz

Thanks for your contribution

rapidfuzz doesn't behave exactly like fuzzywuzzy

From what I can see, rapidfuzz doesn't work with all iterables.
Here's a piece of code to demonstrate it:

from fuzzywuzzy import process as fuzzywuzzy # version 0.18.0
from rapidfuzz import process as rapidfuzz # version 0.7.7

d = {"cow": "",	"sheep": ""}

# works in fuzzywuzzy:
fw = fuzzywuzzy.extractOne("cow", d.keys())
print(fw)
# ('cow', 100)

# doesn't work in rapidfuzz:
rf = rapidfuzz.extractOne("cow", d.keys())
print(rf)
# None

NOTE: there is an easy fix - converting the iterable to a list. But doing so is undesired as it would slow the algorithm down.

Other string matching algorithms?

Have you considered other algorithms such as Smith-Waterman, Jaro-Winkler, or even the new/improved Levenshtein-Damerau?

How to get index value for matched string

Incompatible with Python 2.7

Can't install the package seemingly because of the encoding parameter being supplied to open() in the setup.py file:

    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "build/rapidfuzz/setup.py", line 7, in <module>
        with open(path.join(this_dir, "VERSION"), encoding='utf-8') as version_file:
    TypeError: 'encoding' is an invalid keyword argument for this function
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/build/rapidfuzz/setup.py", line 7, in <module>

    with open(path.join(this_dir, "VERSION"), encoding='utf-8') as version_file:

TypeError: 'encoding' is an invalid keyword argument for this function

I also tried to install from source but it seems I don't have the necessary levenshtein.hpp header files

reduce unnecessary copies

Since partial_ratio currently only works with similar char types, while python strings might be 1byte, 2byte or 4byte per character, they are currently all copied into a new string that is using 4byte per character. However this copy is very expensive (e.g. fuzz.ratio got over 2 times faster by getting rid of this copy).
As of now it would be possible to check the two strings and get rid of the copy when they are using the same char type and copy only one of the strings when it is not similar (right now both are copied), which should already be a big improvement. However it would be best to update partial_ratio to accept different char types, to get rid of all the copies (this would clean up the code aswell, since it is not required to have different implementations for functions that work with different char types and functions that don't).

Since many algorithms are using partial_ratio (partial_token_sort_ratio, partial_token_set_ratio, partial_token_ratio, WRatio), this applies for all of them.

extractOne crashes on Mac

Hi!

rapidfuzz-0.13.3 crashes Python process or Jupyter kernel when running simple example on Mac OS Catalina 10.15.7 on a 2014 MacBook Pro 2,8 GHz Dual-Core Intel Core i5.

Example:

Python 3.7.3 (default, May 13 2020, 14:01:50) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from rapidfuzz import fuzz, process
>>> process.extract("hello", ["abc", "hello", "gbd", "hello"])
[('hello', 100.0, 1), ('hello', 100.0, 3), ('abc', 0.0, 0), ('gbd', 0.0, 2)]
>>> process.extractOne("hello", ["abc", "hello", "gbd", "hello"])
[1]    36550 illegal hardware instruction  python

Seems like a hardware support issue, older versions (before 0.13) work fine.

String matching algorithms

Good day! I am trying to use your module in license plate recognition system. Because recognized strings are not always stable I solve clustering problem with your method rapidfuzz::extractOne. On my test data I see the following results and can't explain them

choices = ['7', 'X946PT78', 'X10', 'T']
print( process.extractOne('X94678', choices) )

[('7', 90.0, 0), ('X946PT78', 85.71428571428571, 3), ('X10', 29.999999999999996, 2), ('T', 0.0, 1)]

Please help me to understand results

Add String Preprocessing

add functions for string preprocessing (e.g. lowercase and remove everything but characters and numbers)

Improve fuzz.partial_ratio

fuzz.partial_ratio has a implementation based on difflib. In some cases this is slower than the implementation used by fuzzywuzzy, that is based on the Levenshtein distance. This is done, because the Levenshtein distance causes problems when searching for matching blocks in some edge cases. One example for this is the following issue of python-Levenshtein ztane/python-Levenshtein#16.

An Alternative would be to use an implementation based on Smith Waterman, which might be faster. As a downside results would probably differ from FuzzyWuzzy. As long as the results are not at least very similar this can't be used for algorithms ported over from FuzzyWuzzy.

Jupyter Notebook Kernel keeps dead

from rapidfuzz import fuzz
from rapidfuzz import process

for i in cat_no:
    final = process.extractBests(i,sent,score_cutoff=80)
    if len(final) != 0:
        print(i)
        print(final)

I am running above code in my Jupiter notebook, when I am looping "process.extractBests" more than 1000 times then my Jupiter kernal keep saying it is dead and some random popup coming and saying python program stop working.

can you please help me out

ImportError: DLL load failed while importing _process: The specified module could not be found.

Hello Max,

Greetings for the day!

While I was trying to import process and utils module from rapidfuzz, I have encountered with the below error.
ImportError: DLL load failed while importing _process: The specified module could not be found.

I have C++ 2019 redistributable installed on my laptop.
I also can see the process.py in my anaconda package folder.

I have installed both "0.11.1" and "0.10.0" versions but no luck in both cases.

Please help me in fixing this issue, Thanking you!

lowercasing only works for ascii

The default_processor should have a similar behavior to the Python implementation of FuzzyWuzzy. However it currently only lowercases ASCII, while the Python Lowercase function lowercases different characters as well. The implementation of Python should be reimplemented for RapidFuzz.

process.extractOne, too many values to unpack

This is a note for others who encounter this error.

Short: if you were using extractOne in v0.12.*, you may need to change your usage going forward to allow for unpacking of an additional value i (the index or key of the value chosen).

I was doing something like this:

choice, score = process.extractOne(s, choices)

This worked under v0.12.* but was changed in v0.13.*.
To have similar behavior in v0.13.*:

choice, score, i = process.extractOne(s, choices)

Where i is the index of the chosen value from the choices provided.

This has been noted in the README and documentation so I am simply logging this issue for any others who may have run into it as well. So either version pin to 0.12.5 or upgrade and unpack the i value.

Note to @maxbachmann,

First off, let me say that I greatly appreciate your work on this library, makes life a lot easier on my end 🙂
Do you plan to release a v1 of this package or is this package currently "stable". If you consider this package "stable" I would recommend bumping the version to v1.0.0 and following semver so that we don't hit these issues in the future.

Bug when using pandas series as fyzzu matching list

Hello,
I have a very weird and specific bug happening with rapidfuzz 1.1.1 (doesn't happen in 0.13.0)
Rapidfuzz crashes in that precise situation:

import rapidfuzz
import pandas as pd
matches = rapidfuzz.process.extract(
    "test",
    pd.Series(['test color brightness', 'test lemon', 'test lavender'], index=[67478, 67479, 67480]),
)

If i convert the pandas Series to a list it works
If I use default Series indexing it works

The bug seems to happen only in that very specific case.
Also if I use "test test" as input my notebook kernel directly dies.

Do you know where it comes from?
See below the traceback when I get to see it:

KeyError Traceback (most recent call last)
~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: <pandas._libs.hashtable.Int64HashTable object at 0x7f28586fe2f0>

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
in
3 matches = rapidfuzz.process.extract(
4 "test",
----> 5 pd.Series(['test color brightness', 'test lemon', 'test lavender'], index=[67478, 67479, 67480]),
6 )
7 matches

src/cpp_process.pyx in cpp_process.extract()

src/cpp_process.pyx in cpp_process.extract_dict()

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/series.py in getitem(self, key)
904 return self._get_values(key)
905
--> 906 return self._get_with(key)
907
908 def _get_with(self, key):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/series.py in _get_with(self, key)
923 elif not is_list_like(key):
924 # e.g. scalars that aren't recognized by lib.is_scalar, GH#32684
--> 925 return self.loc[key]
926
927 if not isinstance(key, (list, np.ndarray, ExtensionArray, Series, Index)):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/indexing.py in getitem(self, key)
877
878 maybe_callable = com.apply_if_callable(key, self.obj)
--> 879 return self._getitem_axis(maybe_callable, axis=axis)
880
881 def _is_scalar_access(self, key: Tuple):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1108 # fall thru to straight lookup
1109 self._validate_key(key, axis)
-> 1110 return self._get_label(key, axis=axis)
1111
1112 def _get_slice_axis(self, slice_obj: slice, axis: int):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
1057 def _get_label(self, label, axis: int):
1058 # GH#5667 this will fail if the label is not present in the axis.
-> 1059 return self.obj.xs(label, axis=axis)
1060
1061 def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
3491 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
3492 else:
-> 3493 loc = self.index.get_loc(key)
3494
3495 if isinstance(loc, np.ndarray):

~/data-perrierh/src/asu-ds/venv-ds/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:

KeyError: <pandas._libs.hashtable.Int64HashTable object at 0x7f28586fe2f0>

Strange behavior in partial_ratio()

Hi @maxbachmann!

v0.12.2

from rapidfuzz import fuzz
fuzz.partial_ratio('no', 'bnonco')
50.0

Anyway, thank you for the really rapid library!=)

Token set scorers behave weirdly

Is this behavior correct? Why is "study physics physics 2 video" not scored higher than "study physics physics 2" in all scorers except partial_ratio?

In [11]: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_token_sort_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_token_set_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.token_set_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.token_sort_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.token_sort_ratio("physics 2 vid", "study physics physics 2 video")
    ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2")
    ...: fuzz.partial_ratio("physics 2 vid", "study physics physics 2 video")
Out[11]: 76.92307692307692
Out[11]: 76.92307692307692
Out[11]: 100.0
Out[11]: 100.0
Out[11]: 81.81818181818181
Out[11]: 81.81818181818181
Out[11]: 66.66666666666666
Out[11]: 61.904761904761905
Out[11]: 78.26086956521739
Out[11]: 92.3076923076923

Persistent "dead Jupyter kernel" issue with the latest version (v0.13.1) hosted on conda-forge

Hi there,

First of all, thanks so much for this useful library. Despite having installed the latest version, I keep seeing the "dead Jupyer kernel" issue when applying process.extractOne to a Pandas dataframe column containing more than 1000 rows, as documented here. The issue went away after I downgraded to v0.7.6

Different scores between fuzzywuzzy and rapidfuzz

Based on this Stackoverflow message, I expected rapidfuzz and fuzzywuzzy to return identical scores (apart from rapidfuzz providing a higher precision). However, I notice different results in certain situations. For an example, see the code below.

Are my expectations incorrect and is this difference in behavior as expected?

from fuzzywuzzy import fuzz as fuzzy_fuzz
from rapidfuzz import fuzz as rapid_fuzz  # 0.7.10

string_1 = "abc defgh"
string_2 = "gelieve deze nota binnen 14 dagen te voldoen op iban tnv abcdefgh"

fuzzy_fuzz.partial_ratio(string_1, string_2)  # -> 89
rapid_fuzz.partial_ratio(string_1, string_2)  # -> 94.11764705882352

Core dump error

When I was using grayskull for flake8-isort, rapidfuzz returned a core dump error.

Command to reproduce it using grayskull:

grayskull pypi flake8-isort

rapidfuzz version: 0.7.5
with version 0.6.7 it was working fine

terminate called after throwing an instance of 'boost::wrapexcept<std::out_of_range>'
  what():  string_view::substr
[1]    181960 abort (core dumped)  grayskull pypi flake8-isort

A minimal example to reproduce this problem is:

from rapidfuzz import process

all_choices = ['GPL-1.0-only', 'GNU General Public License v1.0 or later', 'GPL-1.0-or-later', 'GPL-2.0-only', 'GNU General Public License v2.0 only', 'GNU General Public License v2.0 or later', 'GPL-2.0-or-later', 'GPL-3.0-only', 'GNU General Public License v3.0 only', 'GNU General Public License v3.0 or later', 'GPL-3.0-or-later', 'Giftware License', 'GNU Library General Public License v2 only', 'LGPL-2.0-only', 'GNU Library General Public License v2 or later', 'LGPL-2.0-or-later', 'LGPL-2.1-only', 'GNU Lesser General Public License v2.1 only', 'LGPL-2.1-or-later', 'GNU Lesser General Public License v2.1 or later', 'GNU Lesser General Public License v3.0 only', 'LGPL-3.0-only', 'LGPL-3.0-or-later', 'LGPLLR']

process.extract("GPL version 2", all_choices)

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 7) > this->size() (which is 6)
[1]    185021 abort (core dumped)  python

improve documentation

the current documentation is mainly copy pasted from FuzzyWuzzy since the functions can be used in the same way. However many features are not properly documented and therefore lead to people using FuzzyWuzzy in an unoptimal way.

Add Ci Integration

build python wheels using manylinux container in CI
add more Unit tests and run them in CI
add more Benchmarks and run them in CI
release to conda-forge
add benchmark to compare with fuzzywuzzy (based on the benchmark fuzzywuzzy is using)

import can't find rapidfuzz.cpp_impl

I use Radpidfuzz in Voco, a privacy focussed voice control addon for the WebThings Gateway smart home controller.

I'm running into an issue loading rapidfuzz (import can't find rapidfuzz.cpp_impl), and I'm not sure what's causing it.

Is Rapidfuzz compatible with this method of installing it?

pip3 install -r requirements.txt -t lib --no-binary :all: --prefix ""

Is it possible to use levenshtein as the scorer for extractOne?

Given a string and a list of strings I would like to find the best match that is < some max distance using Levenshtein distance. I think extractOne is what I'm looking for, however, when trying to use extractOne with levenshtein as the scorer I get the error TypeError: levenshtein() got an unexpected keyword argument 'processor'. I assume this is because levenshtein doesn't take a processor argument.

In my case, all my strings are preprocessed to all caps so I don't need to use a processor but still get the same error when using processor = None

To get around this I'm doing something like:

Dist = namedtuple("Dist", ["dist", "query", "choice"])

def best_match(query, choices, max_dist):
    min_dist = max_dist
    best_match = None
    for choice in choices:
        dist = string_metric.levenshtein(query,choice, max=max_dist)
        if dist == 0:
            best_match = Dist(dist, query, choice)
            break
        if dist != -1 and dist < min_dist:
            best_match = Dist(dist, query, choice)
            min_dist = dist
    return best_match

I would guess this is slower than extractOne but I'm not sure how to accomplish this otherwise.

tuple score are not supported

FuzzyWuzzy does work with tuple scores. Try, e.g., fzp.extractOne(query='stack', choices=['stick', 'stok'], processor=None, scorer=lambda s1, s2: (fzz.WRatio(s1, s2), fzz.QRatio(s1, s2)), score_cutoff=(0, 0)). This approach allows for lexicographical ordering of matches. Say two matches have the same first score, they may not have so with the second one
Actually, the two sets of lines that prevent tuple scores are 170:172 and 189:191 in process.py. When dealing with identical scores, fuzzywuzzy does simply return the first one it finds within (what appears to be) an unordered iterable... Admittedly debatable.

issue origin: https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column/61371170?noredirect=1#comment112028997_61371170

pybind11 with iterables

right now only lists are accepted, but it should be possible to convert any iterable to a vector

Incorrect type annotation in `extractOne`

Here is the annotation of extractOne:

https://github.com/maxbachmann/rapidfuzz/blob/9b64ad2fee665b568ff3ff04aafdbadda2d8ed99/src/rapidfuzz/process.py#L138

But here returns Tuple[str, float, str]

https://github.com/maxbachmann/rapidfuzz/blob/9b64ad2fee665b568ff3ff04aafdbadda2d8ed99/src/rapidfuzz/process.py#L175

update library dependencies

Hello World =)

I'm using django 3.1.2 and rapidfuzz 0.12.3 is compatible only with Django <= 3.0.9, could you please update depencies for me to use it ?

Thanks ;)

my centos7 server must have gcc installed but still fails to install this library

pip3 install rapidfuzz
Collecting rapidfuzz
Using cached https://files.pythonhosted.org/packages/26/49/70355eeee8420ec4856a6e27146df8193ba14180cc416d33abdeb9e35b69/rapidfuzz-0.13.1.tar.gz
Installing collected packages: rapidfuzz
Running setup.py install for rapidfuzz ... error
Complete output from command /usr/local/bin/python3.7 -u -c "import setuptools, tokenize;file='/tmp/pip-install-g250_fyy/rapidfuzz/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-357g06wr/install-record.txt --single-version-externally-managed --compile:
/usr/local/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/rapidfuzz
copying src/rapidfuzz/init.py -> build/lib.linux-x86_64-3.7/rapidfuzz
copying src/rapidfuzz/fuzz.py -> build/lib.linux-x86_64-3.7/rapidfuzz
copying src/rapidfuzz/process.py -> build/lib.linux-x86_64-3.7/rapidfuzz
copying src/rapidfuzz/utils.py -> build/lib.linux-x86_64-3.7/rapidfuzz
running egg_info
writing src/rapidfuzz.egg-info/PKG-INFO
writing dependency_links to src/rapidfuzz.egg-info/dependency_links.txt
writing top-level names to src/rapidfuzz.egg-info/top_level.txt
reading manifest file 'src/rapidfuzz.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files found matching 'src/rapidfuzz-cpp/rapidfuzz/process.hpp'
warning: no previously-included files found matching 'src/rapidfuzz-cpp/rapidfuzz/process.txx'
warning: no files found matching '*' under directory 'extern/boost'
writing manifest file 'src/rapidfuzz.egg-info/SOURCES.txt'
running build_ext
building 'rapidfuzz.levenshtein' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/src
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Isrc/rapidfuzz-cpp/rapidfuzz -Isrc/rapidfuzz-cpp/extern -Iextern -I/usr/local/include/python3.7m -c src/py_levenshtein.cpp -o build/temp.linux-x86_64-3.7/src/py_levenshtein.o -O3 -std=c++14 -Wextra -Wall -DVERSION_INFO="0.13.1"
gcc: error: unrecognized command line option ‘-std=c++14’
error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/usr/local/bin/python3.7 -u -c "import setuptools, tokenize;file='/tmp/pip-install-g250_fyy/rapidfuzz/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-357g06wr/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-g250_fyy/rapidfuzz/

Feature Suggestion: Select specific scorers for WRatio

Given that certain types of scorers are most useful in certain situations, I thought it would be interesting for WRatio to have a way to select only certain scorer functions for weighting.

Install on debian?

Hi, I'm trying to install this on a droplet running Debian 10.2, gcc 8.3.0-6. When I try to install this package, I get a timeout when running setup.py bdist_wheel for rapidfuzz. I'm not sure how to fix this, can you offer any guidance? I'm stuck with Debian for an unrelated reason unfortunately or I'd just switch to Ubuntu. Thanks!

Inconsistent behavior between process.extractOne and fuzz.ratio

I am using rapidfuzz for similarity of devanagari words. Here's a reproducible example.

from rapidfuzz import fuzz, process

word = 'मस्सा'
wordlist = {'शर्ट', 'वर्ट', 'वार्ट', 'वऑर्ट', 'वॉर्ट'}

process.extractOne(word, wordlist)
# gives: ('वार्ट', 100.0)

for i in wordlist: print(fuzz.ratio(i, word))
# gives
# 22.22222222222222
# 22.22222222222222
# 19.999999999999996
# 19.999999999999996
# 19.999999999999996

In the output of process.extractOne best score should be 22.22 instead of 100.0.

Environment

Python: 3.6.10
Rapidfuzz: 0.7.8

Error after upgrading to 1.1.0: SystemError: NULL object passed to Py_BuildValue

The fuzzy_match function below that I use in various projects has stopped working after upgrading to 1.1.0. Downgrading to 1.0.0 fixes the issue:

from rapidfuzz import process

def fuzzy_match(query, mapped_choices, score_cutoff):
    best_matches = [
        {"match": match, "score": score, "result": result}
        for (match, score, result) in process.extract(query, mapped_choices, score_cutoff=score_cutoff)
    ]
    return (
        best_matches
        if best_matches
        else [
            {"match": match, "score": score, "result": result}
            for (match, score, result) in process.extract(query, mapped_choices, limit=3)
        ]
    )

For example, the code below raises a SystemError after upgrading:

mapped_choices = {"a932": "foo", "f657": "bar", "8edb": "baz"}
best_matches = fuzzy_match("biz", mapped_choices, score_cutoff=80)

SystemError                               Traceback (most recent call last)
<ipython-input-7-374d57f669dd> in <module>
----> 1 for (match, score, result) in process.extract(query, mapped_choices, score_cutoff=score_cutoff)

src/cpp_process.pyx in cpp_process.extract()

src/cpp_process.pyx in cpp_process.extract_dict()

SystemError: NULL object passed to Py_BuildValue

Downgrading to 1.0.0 does not raise any exception and produces a list of matches, as expected:

$ print(best_matches)
  [{'match': 'baz', 'score': 66.66666666666666, 'result': '8edb'},
   {'match': 'bar', 'score': 33.33333333333333, 'result': 'f657'},
   {'match': 'foo', 'score': 0.0, 'result': 'a932'}]

This is occurring on Mac OS X. I upgraded rapidfuzz in a virtualenv running Python 3.8.6 when the issue was first encountered. Since 1.1.0 involved changing to a CPython binary from C++, I thought installing from a brand new virtualenv might fix the issue. However, the same error occurred after creating a new virtualenv running 3.8.7.

Any idea what the root cause of this issue could be?

[Questions]: Weighting characters to handle acronyms / fixed numbers

Hi - thanks for all your work on this library, it works very nicely.

I have a question about applying weights to certain characters in order to cater for things like acronyms and numbers that should remain the same:

For example, if I had a test string Guide V16 LDV and an array of strings which included the 'true' version: Guide V16 Laker Deep V, I might want to upweight the capital letters and numerical characters in the test string (I don't want the numbers to change) so that a match to Guide V16 Laker Deep V is more likely than a match to Guide V18 SD for example.

I have noticed that token_set_ratio gives more accurate results than token_sort_ratio when acronyms are involved, but when the 'true' version is longer than the test string, the 100% match from token_set_ratio causes it to match to the wrong thing: There are instances where Guide V16 LDV should match to Guide V16 Laker Deep V and not Guide V16 Laker Deep V Hidden Pro for example

For context, I'm dealing with a large dataset consisting of product makes & models entered by users. I also have a list of 'ground truth' make & models provided by the product manufacturers. My goal is to try and match up the user entered models with the manufacturer's model name. Is the inclusion of a custom weighting complex, or is it incorporated in some other method I've overlooked?

PyPy implementation interprets characters wrong

Characters in the pypy implementation appear to be treated in a wrong way. E.g.

string_metric.levenshtein('00000', '000\x800')

returns 2 instead of 1.
However

string_metric.levenshtein('0000', '000\x80')

returns a distance of 1 correctly.

This bug only exists in PyPy and not in the normal Cpython implementation which returns 1 in both cases

manylinux wheel build failing

The Manylinux build is currently failing because of pypa/cibuildwheel#299.
This is already fixed but not deployed yet

pybind11 default arguments slow

This should be improved by adding small python functions that call them and therefore remove the default arguments from pybind11. This way people who really need even more performance can directly call the cpp function and specify all arguments while everyone else keeps the nice interface with the default arguments

Asian languages usage

Hello,

Not an issue per say, but I was curious about possible asian languages compatibility. Chinese, Korean, and Japanese all perform very poorly with Levenshtein distance matching because they’re not alphabet-based.

A solution would be to use a romanisation library to translate them to alphabet characters in the same way that all characters are put in lowercase.

I’d love to give a hand but unfortunately my cpp is pretty rusty and I’d do more harm than good.

Exact matches in differing cases do not return 100 %

While testing RapidFuzz against the current version of FuzzyWuzzy, i came across an interesting issue where no matter how I preprocess the strings e.g. ("I'LL NEVER LET YOU GO") with a python .strip().lower() vs. ("I'll Never Let You Go") using the same, fuzzywuzzy returns 100% using token_sort_ratio whereas v1.0.2 of RapidFuzz returns 96%. very strange. Any idea as to why this would be occurring? I really like the speed of RapidFuzz and look forward to using more extensive versions i.e. list matching as i've seen somewhere in the change requests.

Parametring weights of Levenstein distance

Hi !

I would like to parameterize myself some weights of Levenstein distance. For instance, one of my project requires that I weight more deletion than replacement.
Could you make it possible ?

Thanks =)

Add Rust Implementation

There is already a rust crate with an outdated version. So the new code should be ported to rust again

Different Behavior than fuzzywuzzy

Just as the title says - fuzzywuzzy and rapidfuzz output different results.
Here's an example:

# rapidfuzz:
from rapidfuzz import process
process.extractOne("A", ["B", "C", "D"]) # -> None

# fuzzywuzzy:
from fuzzywuzzy import process
process.extractOne("A", ["B", "C", "D"]) # -> ("B", 0)

The line responsible for that is this

If you intend to keep this, please note that it's not backward compatible, please document this behavior.

ImportError

ImportError Traceback (most recent call last)
in
----> 1 from rapidfuzz import process

C:\Program Files\Anaconda3\envs\proj\lib\site-packages\rapidfuzz_init_.py in
2 rapid string matching library
3 """
----> 4 from rapidfuzz import process, fuzz, levenshtein, utils

C:\Program Files\Anaconda3\envs\proj\lib\site-packages\rapidfuzz\process.py in
3 # Copyright © 2011 Adam Cohen
4
----> 5 from rapidfuzz import fuzz, utils
6 from typing import Iterable, List, Tuple, Optional, Union, Callable, Generator
7 import heapq

ImportError: DLL load failed: The specified module could not be found.

I'm currently getting the above ImportError when import the package in Python. Is it because the modules are supposed to be from fuzzywuzzy or another source?

Add sdist on PyPI as well

Hello,

Could you please add the sdist on PyPI as well?
A lot of different packages managers use the sdist instead of wheels (debian, conda, etc...)

suggestion: many to many match

Great project,
Is there a possibility of optimization when matching a list of queries against a list of choices?

Matching error: TypeError: Choices must be a sequence of strings

Here is the reproducible example:

from rapidfuzz import process
ch = {'abc\x00', 'abcd:\x00', 'abc:\x00d'}
process.extractOne("abc", ch)

And here's the traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Anaconda/envs/hing/lib/python3.7/site-packages/rapidfuzz/process.py", line 70, in extractOne
    return rapidfuzz._process.extractOne(query, choices, score_cutoff, bool(processor))
TypeError: Choices must be a sequence of strings

As a workaround, I can remove the offending character (\x00) from each of the choices, but I have a lot many of these lists. And the costs will add up.

Implement process.extract in C++

There are a couple of performance improvements that can be achieved by implementing process.extract in C++

smaller overhead than Python function calls
Improvements to scorers that are already implemented for process.extractOne, so e.g. for fuzz.token_sort_ratio the query is only sorted once
Maintain a sorted vector of size limit with all results while extracting results. This way the last element can be used as score_cutoff. This is very slow using Python Lists, but might bring performance improvements when implemented in C++

process_extract from rapidfuzz taking a long time

I'm trying to use process_extract from rapidfuzz but it took a long time running which
make me feel there's something wrong, I don't know so I wanna share the commands with you to enlighten me if
there's a mistake that make it take a long time:

name = []
match_to= []
no_match= []
similarity= []

for i in range(len(data)):
       if (data.index in mapping.index):
               match= process.extract(data['TEXT_name'], mapping['Description_name'].str.split(' '),
                                      scorer= fuzz.partial_ratio, limit= 3)
               if(match[0][1] > 75):
                   try:
                       name.append(data['TEXT_name'])
                       match_to.append(match[0][0])

                   except:
                       no_match.append(['TEXT_name'])
                       similarity.append(match[0][1])

Add support to release aarch64 wheels

Problem

On aarch64, pip install RapidFuzz builds the wheels from source code and then install it. It requires user to have development environment installed on his system. also, it take some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, pip install RapidFuzz should download the wheels from pypi

@maxbachmann , please let me know your interest on releasing aarch64 wheels. I can help in this.