gandersen101 / spaczz Goto Github PK

View Code? Open in Web Editor NEW

245.0 10.0 27.0 1.26 MB

Fuzzy matching and more functionality for spaCy.

License: MIT License

Python 92.15% Jupyter Notebook 7.85%

natural-language-processing data-science python nlp artificial-intelligence ai spacy nlp-library fuzzy-matching regex

spaczz's People

Contributors

Stargazers

Watchers

spaczz's Issues

Handling the same token in different categories

This is not a report, but a question. If I have the same token with two different labels, how will spaczz handle it? The question comes because spacy seems to pick the label unpredictably: explosion/spaCy#6752

Questions:

a) is it possible to get both matches somehow? I'm interested in getting a list of all matches of a LABEL sometimes, and the "best ones" in other cases, to some definition of BEST :)
b) if I can't get both, it is possible to get a callback to decide myself what to do?

Compare strings stripping accents/casi sensitive

First of all, thanks for the library @gandersen101 . I'm starting using it and it's really powerful.

Using SpaczzRuler with fuzzy patterns, by default it compares strings in a case-insensitive way. Is there a way of changing this behaviour?

Similarly, is there a way of comparing strings w/o taking into account accents? This is, making "test" equivalent to "tést". It could be hacked changing the string for a accent-stripped version of it (since it maintains the token structure), but maybe is an easier way.

import sys
import spacy
import spaczz
from spaczz.pipeline import SpaczzRuler

print(f"{sys.version = }")
print(f"{spacy.__version__ = }")
print(f"{spaczz.__version__ = }")


nlp = spacy.blank("en")

fuzzy_ruler = SpaczzRuler(nlp, name="test_ruler")
fuzzy_ruler.add_patterns([{"label" : "TEST", 
            "pattern" : "test", 
            "type": "fuzzy",}])


doc = fuzzy_ruler(nlp("this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst"))
print(f"\nText:\n{doc}\n")
print("Fuzzy Matches:")
for ent in doc.ents:
    if ent._.spaczz_type == "fuzzy":
        print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

Output

sys.version = '3.9.0 (default, Nov 15 2020, 06:25:35) \n[Clang 10.0.0 ]'
spacy.version = '3.0.6'
spaczz.version = '0.5.2'

Text:
this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst

Fuzzy Matches:
('test', 3, 4, 'TEST', 100)
('TEST', 9, 10, 'TEST', 100)
('tast', 13, 14, 'TEST', 75)
('TesT', 18, 19, 'TEST', 100)
('tést', 20, 21, 'TEST', 75)
('tëst', 22, 23, 'TEST', 75)

Fuzzy Match of Term Combinations

My task is querying medical texts for institute names using a rule as below:
[{'ENT_TYPE': 'institute_name'}, {'TEXT': 'Hospital'}]
The rule will extract the hospital name only if it is bound by the word 'Hospital', including for example "Mount Sinai Hospital" but excluding "Mount Sinai".
spaczz works great for single term or phrase but I did not see an option to build multi-words rules as in the rule above.
Can I use scpaczz to identify typos for this entity, for example, "Mount Sinai Mospital"?

Add start/end span trimmers to fuzzy matching

Add functionality to trim fuzzy match spans to keep unwanted tokens from populating the start/end of a match span, i.e. spaces, punctuation, and/or stop words.

SpazzyRuler ID

is it possible to add an "id" to the pattern like you can with original spacy api? https://spacy.io/usage/rule-based-matching#entityruler-ent-ids

https://github.com/gandersen101/spaczz#spaczzruler

 p = {
            "label": entity_type,
            "pattern": d[col],
            "id": d["id"],
            "type": "fuzzy"
        }

Pattern: {'label': 'PERSON', 'pattern': 'DrDisrespect', 'id': '148265b0-847b-414c-9f8e-916561412c55', 'type': 'fuzzy'}
Text: "Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options"

This fails to find Dr Disrespect and im unable to print the id


    data = [{
        "label": ent.label_,
        "name": ent.text,
        "id": ent.ent_id_
    } for ent in doc.ents]

print(data)

I need this feature because when i find the entity i need to match it to an "id" that stored in my database.

aside: is it also possible to edit the fuzzy search edit distance to 1 character

Return match quality details from the TokenMatcher

Need to develop a way to return match quality details (fuzzy ratios and fuzzy regex counts) from TokenMatcher matches. I currently only do the fuzzy matching token patterns in spaczz before dropping the fuzzy matched patterns into spaCy's Matcher. While utilizing spaCy's Matcher means less work on my end and better performance I don't have an easy way to map the fuzzy details back to the Matcher matches.

Plural is not chosen over similar word

How to reproduce: fuzzy pattern with "Goldriesling", "Riesling", default fuzzy_func. Search on a phrase like They sell many Rieslings.

Found: Goldriesling. Expected: Riesling.

Threshold fuzzy ratio using FuzzyMatcher

I have a custom component in a spaCy pipeline where I am using the FuzzyMatcher. The tutorial does a good job showing how to implement a threshold using the spaczz_ruler but it is less clear how to do this using the FuzzyMatcher. I am struggling to implement a threshold system where a user can configure a threshold. The following system is not effective. What would be a better way to implement a threshold in this design pattern?

class TermPipeline(Component):

    def __init__(self, nlp, term_list, fuzzy_threshold=75):
        self.fuzzy_threshold = fuzzy_threshold
        self.term_list = term_list
        self.label_hash = nlp.vocab.strings['TERM_MATCH']
        self.matcher = FuzzyMatcher(nlp.vocab, attr="LEMMA")
        Token.set_extension("parent_term", force=True, default=None)
        if isinstance(self.term_list[0], dict):
            patterns = [nlp(text['term']) for text in term_list]
            # Creates the term word patterns and adds them to the matcher
        else:
            patterns = [nlp(text) for text in term_list]

        self.matcher.add('TerminologyList', on_match=self.add_event_ent, patterns=patterns)


    def __call__(self, doc):
        matches = self.matcher(doc)
        if isinstance(self.term_list[0], dict):
            for _, start, end, ratio in matches:
                if self.fuzzy_threshold <= ratio:

                      entity = Span(doc, start, end, label=self.label_hash)
                      for term in self.term_list:
                            if term['term'].lower() == entity.text.lower():
                                 doc[start]._.set('parent_term', term['parent_term'])
        return doc


@Language.factory("term_component", default_config={})
def create_term_component(nlp, name, term_list, fuzzy_threshold):
    return TermPipeline(nlp, term_list, fuzzy_threshold)

Is there a way to get back the original dictionary item that matches?

NAME Grint Anderson 86
GPE Nashv1le 82

@gandersen101 For both 'Grint' and 'Nashv1le' are from text, and their corresponding dictionary items are "Grant Andersen" and "Nashville". Is there a way to know which dictionary item matches the text? For example, if I want to find misspellings, how do I know 'Grint' is whose misspelling?

install spaczz

I try pip install spaczz but ....

clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -Iextern/rapidfuzz-cpp/ -Icapi/ -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/cpp_process.cpp -o build/temp.macosx-10.9-universal2-cpython-311/src/cpp_process.o -O3 -std=c++14 -Wextra -Wall -Wconversion -g0 -DNDEBUG -stdlib=libc++ -mmacosx-version-min=10.9 -DVERSION_INFO="1.9.1"
src/cpp_process.cpp:253:12: fatal error: 'longintrepr.h' file not found
#include "longintrepr.h"
^~~~~~~~~~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rapidfuzz
Failed to build rapidfuzz
ERROR: Could not build wheels for rapidfuzz, which is required to install pyproject.toml-based projects

Refactor to Support Long-Term Design Plans

Refactoring code to better align with long-term plans of further integrating with spaCy vocabs and eventually implementing pieces in Cython.

Include Windows Testing

Add Windows testing to GitHub actions and modify noxfile.py to support Windows.

RegexMatcher: Match Captures?

I am able to get this regex working using below code.

import spacy
from spaczz.matcher import RegexMatcher

nlp = spacy.blank("en")
text = "Hello how are you? Proficiency in ETL tools like Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview"
doc = nlp(text)

matcher = RegexMatcher(nlp.vocab)
matcher.add(
    "SKILL",
    [
        r"""(?i)proficiency in ([\w\s]+) tools like (.*$)"""
    ],
)  
matches = matcher(doc)

for match_id, start, end, counts in matches:
    print(match_id, doc[start:end], counts)

And I get the matched sentence as output as expected.
However I am unsure if there is way I can get access to the match capture ([\w\s]+) & (.*$). Looking for any suggestions or advise.
Once I get matched result/sentence, I would like to access the match captures ETL and Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview.

Build - Python 3.9 Testing

Add Python 3.9 into the nox build process. Don't anticipate issues but want to have it covered.

Get original matched pattern back

Hi,
A very useful feature would be to have the original pattern matched by SpaczzRuler, because when similar patterns are added, there may be doubts about which one is the original pattern matched. I guess this issue connects to a potential link with spacy knowledge base id's.
Thank you

IndexError: [E201] Span index out of range.

fuzzy matcher unable to process matcher(doc). It was working 48 hours back. It's not working now.

File "xxxxxxxxxxxxxxxxxxxxxxxxxx", line 46, in pattern_matcher
matched_by_fuzzy_phrase = matcher_fuzzy(doc)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in call
matches_wo_label = self.match(doc, pattern, **kwargs)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match
matches_w_nones = [
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in
self._adjust_left_right_positions(
File "/home/aravind/nlu_endpoint/NLUSQL_ENV3/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
File "span.pyx", line 503, in spacy.tokens.span.Span.text.get

File "span.pyx", line 190, in spacy.tokens.span.Span.getitem

IndexError: [E201] Span index out of range.

Switch fuzzywuzzy to rapidfuzz

Fuzzy matching currently provided by fuzzywuzzy should be switched to rapidfuzz. The latter has a more liberal license and runs faster.

Adding POS tagging while building pattern for Spaczzruler

Hello, I am really liking Spaczz, to fuzzy match entity patterns.

Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}

Add spaCy 3 compatibility

Currently spaczz does not install alongside the new spaCy 3 release. Attempting to upgrade a project to spaCy 3 while also using spaczz gives the following Pip error message:

The conflict is caused by:
    The user requested spacy==3.0.3
    spaczz 0.4.1 depends on spacy<3.0.0 and >=2.2.0

Add Predefined Regex Customization

Extend API to allow for adding/removing user-defined predefined regexes and fuzzy matchers.

Update rapidfuzz

Currently rapidfuzz is required in versions >=1.0.0 and <2.0.0 in pyproject.toml:

rapidfuzz = "^1"

however rapidfuzz is currently available in version 2.15.1.

This breaks other packages or installations which require a more recent version of rapidfuzz directly or indirectly.

After updating to:

rapidfuzz = ">=1.0.0"

all unit tests are still successful.

Is there a reason why versions 2.x are hold back?

Phrase search considering synonyms

Would it be possible to use this matching library also for some smarter phrase search which would take into consideration spacy's word vectors?
For instance, if I create a matcher object like this:

import spacy
import spaczz

nlp = spacy.load('en_core_web_lg')
matcher = spaczz.matcher.FuzzyMatcher(nlp.vocab)
matcher.add("my_phrase", [nlp('humorous story')])

Then it would be maybe interesting to see also match for a sentence like in this example:

matcher(nlp('He told me a very funny story.'))

where there is a sub-phrase "funny story" which is a synonym to a phrase "humorous story" we added to the matcher.

SpaczzRuler configuration

Hi,
I could not find a way to set the various matching parameters using spacy 2, in the SpaczzRuler documentation https://github.com/gandersen101/spaczz/blob/master/examples/fuzzy_matching_tweaks.md there is only the spacy 3 syntax.

I also found this #18 and I tried to applied using the following:

ruler = SpaczzRuler(self.nlp, spaczz_fuzzy_defaults={'min_r2': 98, 'min_r1': 95, 'flex': 2})

but changing parameters didn't seem to change anything.

is that correct or I'm missing something?

Thank you for the cool library!

Clean Unnecessary Methods From FuzzySearcher

There are some redundant methods in the fuzzy searcher and the naming of methods should be updated.

Comparison method(s) for fuzzy ratios and fuzzy regex counts

In order to sort SpaczzRuler matches across the different matchers by quality, I need a method or methods for comparing fuzzy ratios (ints between 0 and 100) and fuzzy regex counts (tuples of insertion, deletion and substitution counts). Standardizing the fuzzy regex counts back to a ratio is probably the most effective method at first thought.

accessing match ratio/score

Is it possible to get ratio (some sort of distance/similarity score) for each match?

Installing spaczz with successful RapidFuzz installed

Similar to another post, but it still doesn't install even after successfully installing RapidFuzz separately.
This is what I get:
cl : Command line warning D9025 : overriding '/W3' with '/W4'
cpp_process.cpp
src/cpp_process.cpp(16): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

How to restrict fuzzy search

Is it possible to restrict the fuzzy search because in my example it is returning unwanted entities.

Text: Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.

patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]

('Dr Disrespect', 'PERSON')
('YouTube', 'PERSON')

The unwanted entity here is ('YouTube', 'PERSON'), is there some way to restrict the fuzzy search so that it does not identify YouTube in the text to be a person?

Full Code:

      import spacy
        from spaczz.pipeline import SpaczzRuler

        nlp = spacy.blank("en")
        text = """Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
        doc = nlp(text)

        patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
        ]

        ruler = SpaczzRuler(nlp)
        ruler.add_patterns(patterns)
        doc = ruler(doc)

        data = [{
            "label": ent.label_,
            "name": ent.text,
        } for ent in doc.ents]

        for ent in doc.ents:
            print((ent.text, ent.label_))

EDIT:

i noticed rapidfuzz library provides a score_cutoff as a parameter im looking to set this to 95 so it's strict. I was hoping something like this could be exposed.

Possible infinite loop

Running my tests with spaczz@master they seem to get into an infinite loop at the nlp() call. Stack dumps:

  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 217, in _optimize
    r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
  File "doc.pyx", line 308, in spacy.tokens.doc.Doc.__getitem__
  File "/usr/lib64/python3.8/site-packages/spacy/util.py", line 491, in normalize_slice
    if not (step is None or step == 1):

another ctrl-c during another run:

   self._doc = nlp(text)
  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 205, in _optimize
    rl = self.compare(query, doc[p_l : p_r - f], *args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 109, in compare
    return round(self._fuzzy_funcs.get(fuzzy_func)(a_text, b_text))

Add pattern after adding to spacy pipeline taking long time and memory

There are 1 Million patterns I am trying to add. On adding to blank spacy model:
import spacy from spaczz.pipeline import SpaczzRuler nlp=spacy.blank('en') spaczz_ruler = SpaczzRuler(nlp) spaczz_ruler = nlp.add_pipe("spaczz_ruler") #spaCy v3 syntax spaczz_ruler.add_patterns(patterns) It takes 8 GB of RAM and inference time is around 28 seconds.

If I try to add SpaczzRuler to current ner pipeline using
spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
It is taking high RAM and time. On 32 GB RAM also it is failing
patterns = [ { "label": "NAME", "pattern": "Grant Andersen", "type": "fuzzy", "kwargs": {"min_r2": 90} }]

Refactoring Matchers

Matchers could benefit from inheriting from a base class and the searchers used in them should be composed rather than inherited.

Op + does only match 1 token

I'm using the SpaczzRuler pipeline as specified in here to detect companies based on patterns. Is a very simple pipeline in which I'm trying to match uppercase tokens, but when using the operator + it matches only one token in uppercase, and not as many as possible. The documentation says:
+ | Require the pattern to match 1 or more times.

However, if using the * operator it indeed matches all possible times, as expected.

How to reproduce the behaviour

import spacy
from spaczz.pipeline import SpaczzRuler

model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
    {"label": "COMPANY", 'pattern': [
        {"IS_UPPER": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "?"},
        {"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
        {"IS_PUNCT": True, "OP": "?"}], 
     "type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)

doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (MARMG S.L.,)

model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
    {"label": "COMPANY", 'pattern': [
        {"IS_UPPER": True, "OP": "*"}, {"IS_PUNCT": True, "OP": "?"},
        {"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
        {"IS_PUNCT": True, "OP": "?"}], 
     "type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)

doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (LARGO AND MARMG S.L.,)

Your Environment

Info about spaCy

Platform: Windows-10-10.0.17134-SP0
Python version: 3.8.6
spaCy version: 2.3.5
**spaczz Version Used: 0.5.0

Speed up the detection process

First of all, really appreciate your work and time.

With small input data patterns, it is doing a good job, but when input data patterns crossing more than 1 lakh, it is taking too much time. Is there any possibility which can speed up the process (maybe using on GPU)

Add Span Start/End Trimming Class Functionality

Add pipeline component to "clean" entities after setting (primarily intended for spaczz entities). I.e. if punctuation is included at the start/end of a fuzzy matched entity the span can be trimmed to remove punctuation.

Will also require registering a custom span attribute on entities created through spaczz.

Extend Read the Docs

Build out Read the Docs .rst documentation for comprehensive details.

UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc)

Thanks a lot for your fabulous package; it is really helpful. However, when I tried to reproduce your results, I run into this error:

"UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc)"

Specifically, I used this code snippet:

import spacy
from spacy.pipeline import EntityRuler
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("en_core_web_sm")

entity_ruler = nlp.add_pipe("entity_ruler", before="ner")
entity_ruler.add_patterns(
    [{"label": "GPE", "pattern": "Nashville"}, 
     {"label": "GPE", "pattern": "TN"}]
)

spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
spaczz_ruler.add_patterns(
    [
        {
            "label": "NAME",
            "pattern": "Grant Andersen",
            "type": "fuzzy",
            "kwargs": {"fuzzy_func": "token_sort"},
        },
    ]
)

When I add patterns to the spaCy's EntityRuler, it works OK, when I add patters to SpaczzRuler, it throws the error I specified in above.

I am using Python 3.9, spaCy 3.2.2 on Ubuntu 16.04.

Any help will be highly appreciated.

Refactor Search

There is a lot of redundancy between fuzzy search and similarity search - they both essentially use the same search algorithm. They should inherit the same base and only make the small tweaks necessary.

gandersen101 / spaczz Goto Github PK

spaczz's People

Contributors

Stargazers

Watchers

Forkers

spaczz's Issues

How to reproduce the behaviour

Your Environment

Info about spaCy

Recommend Projects

Recommend Topics

Recommend Org