gandersen101 / spaczz Goto Github PK
View Code? Open in Web Editor NEWFuzzy matching and more functionality for spaCy.
License: MIT License
Fuzzy matching and more functionality for spaCy.
License: MIT License
This is not a report, but a question. If I have the same token with two different labels, how will spaczz handle it? The question comes because spacy seems to pick the label unpredictably: explosion/spaCy#6752
Questions:
a) is it possible to get both matches somehow? I'm interested in getting a list of all matches of a LABEL sometimes, and the "best ones" in other cases, to some definition of BEST :)
b) if I can't get both, it is possible to get a callback to decide myself what to do?
First of all, thanks for the library @gandersen101 . I'm starting using it and it's really powerful.
Using SpaczzRuler
with fuzzy
patterns, by default it compares strings in a case-insensitive way. Is there a way of changing this behaviour?
Similarly, is there a way of comparing strings w/o taking into account accents? This is, making "test" equivalent to "tést". It could be hacked changing the string for a accent-stripped version of it (since it maintains the token structure), but maybe is an easier way.
import sys
import spacy
import spaczz
from spaczz.pipeline import SpaczzRuler
print(f"{sys.version = }")
print(f"{spacy.__version__ = }")
print(f"{spaczz.__version__ = }")
nlp = spacy.blank("en")
fuzzy_ruler = SpaczzRuler(nlp, name="test_ruler")
fuzzy_ruler.add_patterns([{"label" : "TEST",
"pattern" : "test",
"type": "fuzzy",}])
doc = fuzzy_ruler(nlp("this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst"))
print(f"\nText:\n{doc}\n")
print("Fuzzy Matches:")
for ent in doc.ents:
if ent._.spaczz_type == "fuzzy":
print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))
Output
sys.version = '3.9.0 (default, Nov 15 2020, 06:25:35) \n[Clang 10.0.0 ]'
spacy.version = '3.0.6'
spaczz.version = '0.5.2'Text:
this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëstFuzzy Matches:
('test', 3, 4, 'TEST', 100)
('TEST', 9, 10, 'TEST', 100)
('tast', 13, 14, 'TEST', 75)
('TesT', 18, 19, 'TEST', 100)
('tést', 20, 21, 'TEST', 75)
('tëst', 22, 23, 'TEST', 75)
My task is querying medical texts for institute names using a rule as below:
[{'ENT_TYPE': 'institute_name'}, {'TEXT': 'Hospital'}]
The rule will extract the hospital name only if it is bound by the word 'Hospital', including for example "Mount Sinai Hospital" but excluding "Mount Sinai".
spaczz works great for single term or phrase but I did not see an option to build multi-words rules as in the rule above.
Can I use scpaczz to identify typos for this entity, for example, "Mount Sinai Mospital"?
Add functionality to trim fuzzy match spans to keep unwanted tokens from populating the start/end of a match span, i.e. spaces, punctuation, and/or stop words.
is it possible to add an "id" to the pattern like you can with original spacy api? https://spacy.io/usage/rule-based-matching#entityruler-ent-ids
https://github.com/gandersen101/spaczz#spaczzruler
p = {
"label": entity_type,
"pattern": d[col],
"id": d["id"],
"type": "fuzzy"
}
Pattern: {'label': 'PERSON', 'pattern': 'DrDisrespect', 'id': '148265b0-847b-414c-9f8e-916561412c55', 'type': 'fuzzy'}
Text: "Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options"
This fails to find Dr Disrespect and im unable to print the id
data = [{
"label": ent.label_,
"name": ent.text,
"id": ent.ent_id_
} for ent in doc.ents]
print(data)
I need this feature because when i find the entity i need to match it to an "id" that stored in my database.
aside: is it also possible to edit the fuzzy search edit distance to 1 character
Need to develop a way to return match quality details (fuzzy ratios and fuzzy regex counts) from TokenMatcher
matches. I currently only do the fuzzy matching token patterns in spaczz before dropping the fuzzy matched patterns into spaCy's Matcher
. While utilizing spaCy's Matcher
means less work on my end and better performance I don't have an easy way to map the fuzzy details back to the Matcher
matches.
How to reproduce: fuzzy pattern with "Goldriesling", "Riesling", default fuzzy_func. Search on a phrase like They sell many Rieslings
.
Found: Goldriesling
. Expected: Riesling
.
I have a custom component in a spaCy pipeline where I am using the FuzzyMatcher. The tutorial does a good job showing how to implement a threshold using the spaczz_ruler but it is less clear how to do this using the FuzzyMatcher. I am struggling to implement a threshold system where a user can configure a threshold. The following system is not effective. What would be a better way to implement a threshold in this design pattern?
class TermPipeline(Component):
def __init__(self, nlp, term_list, fuzzy_threshold=75):
self.fuzzy_threshold = fuzzy_threshold
self.term_list = term_list
self.label_hash = nlp.vocab.strings['TERM_MATCH']
self.matcher = FuzzyMatcher(nlp.vocab, attr="LEMMA")
Token.set_extension("parent_term", force=True, default=None)
if isinstance(self.term_list[0], dict):
patterns = [nlp(text['term']) for text in term_list]
# Creates the term word patterns and adds them to the matcher
else:
patterns = [nlp(text) for text in term_list]
self.matcher.add('TerminologyList', on_match=self.add_event_ent, patterns=patterns)
def __call__(self, doc):
matches = self.matcher(doc)
if isinstance(self.term_list[0], dict):
for _, start, end, ratio in matches:
if self.fuzzy_threshold <= ratio:
entity = Span(doc, start, end, label=self.label_hash)
for term in self.term_list:
if term['term'].lower() == entity.text.lower():
doc[start]._.set('parent_term', term['parent_term'])
return doc
@Language.factory("term_component", default_config={})
def create_term_component(nlp, name, term_list, fuzzy_threshold):
return TermPipeline(nlp, term_list, fuzzy_threshold)
NAME Grint Anderson 86
GPE Nashv1le 82
@gandersen101 For both 'Grint' and 'Nashv1le' are from text, and their corresponding dictionary items are "Grant Andersen" and "Nashville". Is there a way to know which dictionary item matches the text? For example, if I want to find misspellings, how do I know 'Grint' is whose misspelling?
I try pip install spaczz but ....
clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -Iextern/rapidfuzz-cpp/ -Icapi/ -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/cpp_process.cpp -o build/temp.macosx-10.9-universal2-cpython-311/src/cpp_process.o -O3 -std=c++14 -Wextra -Wall -Wconversion -g0 -DNDEBUG -stdlib=libc++ -mmacosx-version-min=10.9 -DVERSION_INFO="1.9.1"
src/cpp_process.cpp:253:12: fatal error: 'longintrepr.h' file not found
#include "longintrepr.h"
^~~~~~~~~~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rapidfuzz
Failed to build rapidfuzz
ERROR: Could not build wheels for rapidfuzz, which is required to install pyproject.toml-based projects
Refactoring code to better align with long-term plans of further integrating with spaCy vocabs and eventually implementing pieces in Cython.
Add Windows testing to GitHub actions and modify noxfile.py to support Windows.
I am able to get this regex working using below code.
import spacy
from spaczz.matcher import RegexMatcher
nlp = spacy.blank("en")
text = "Hello how are you? Proficiency in ETL tools like Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview"
doc = nlp(text)
matcher = RegexMatcher(nlp.vocab)
matcher.add(
"SKILL",
[
r"""(?i)proficiency in ([\w\s]+) tools like (.*$)"""
],
)
matches = matcher(doc)
for match_id, start, end, counts in matches:
print(match_id, doc[start:end], counts)
And I get the matched sentence as output as expected.
However I am unsure if there is way I can get access to the match capture ([\w\s]+)
& (.*$)
. Looking for any suggestions or advise.
Once I get matched result/sentence, I would like to access the match captures ETL
and Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview.
Add Python 3.9 into the nox build process. Don't anticipate issues but want to have it covered.
Hi,
A very useful feature would be to have the original pattern matched by SpaczzRuler, because when similar patterns are added, there may be doubts about which one is the original pattern matched. I guess this issue connects to a potential link with spacy knowledge base id's.
Thank you
fuzzy matcher unable to process matcher(doc). It was working 48 hours back. It's not working now.
File "xxxxxxxxxxxxxxxxxxxxxxxxxx", line 46, in pattern_matcher
matched_by_fuzzy_phrase = matcher_fuzzy(doc)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in call
matches_wo_label = self.match(doc, pattern, **kwargs)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match
matches_w_nones = [
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in
self._adjust_left_right_positions(
File "/home/aravind/nlu_endpoint/NLUSQL_ENV3/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
File "span.pyx", line 503, in spacy.tokens.span.Span.text.get
File "span.pyx", line 190, in spacy.tokens.span.Span.getitem
IndexError: [E201] Span index out of range.
Fuzzy matching currently provided by fuzzywuzzy should be switched to rapidfuzz. The latter has a more liberal license and runs faster.
Hello, I am really liking Spaczz, to fuzzy match entity patterns.
Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}
Currently spaczz does not install alongside the new spaCy 3 release. Attempting to upgrade a project to spaCy 3 while also using spaczz gives the following Pip error message:
The conflict is caused by:
The user requested spacy==3.0.3
spaczz 0.4.1 depends on spacy<3.0.0 and >=2.2.0
Extend API to allow for adding/removing user-defined predefined regexes and fuzzy matchers.
Currently rapidfuzz is required in versions >=1.0.0
and <2.0.0
in pyproject.toml
:
rapidfuzz = "^1"
however rapidfuzz is currently available in version 2.15.1
.
This breaks other packages or installations which require a more recent version of rapidfuzz directly or indirectly.
After updating to:
rapidfuzz = ">=1.0.0"
all unit tests are still successful.
Is there a reason why versions 2.x are hold back?
Would it be possible to use this matching library also for some smarter phrase search which would take into consideration spacy's word vectors?
For instance, if I create a matcher object like this:
import spacy
import spaczz
nlp = spacy.load('en_core_web_lg')
matcher = spaczz.matcher.FuzzyMatcher(nlp.vocab)
matcher.add("my_phrase", [nlp('humorous story')])
Then it would be maybe interesting to see also match for a sentence like in this example:
matcher(nlp('He told me a very funny story.'))
where there is a sub-phrase "funny story" which is a synonym to a phrase "humorous story" we added to the matcher.
Hi,
I could not find a way to set the various matching parameters using spacy 2, in the SpaczzRuler documentation https://github.com/gandersen101/spaczz/blob/master/examples/fuzzy_matching_tweaks.md there is only the spacy 3 syntax.
I also found this #18 and I tried to applied using the following:
ruler = SpaczzRuler(self.nlp, spaczz_fuzzy_defaults={'min_r2': 98, 'min_r1': 95, 'flex': 2})
but changing parameters didn't seem to change anything.
is that correct or I'm missing something?
Thank you for the cool library!
There are some redundant methods in the fuzzy searcher and the naming of methods should be updated.
In order to sort SpaczzRuler
matches across the different matchers by quality, I need a method or methods for comparing fuzzy ratios (ints between 0 and 100) and fuzzy regex counts (tuples of insertion, deletion and substitution counts). Standardizing the fuzzy regex counts back to a ratio is probably the most effective method at first thought.
Is it possible to get ratio (some sort of distance/similarity score) for each match?
Similar to another post, but it still doesn't install even after successfully installing RapidFuzz separately.
This is what I get:
cl : Command line warning D9025 : overriding '/W3' with '/W4'
cpp_process.cpp
src/cpp_process.cpp(16): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rapidfuzz
Failed to build rapidfuzz
ERROR: Could not build wheels for rapidfuzz, which is required to install pyproject.toml-based projects
Is it possible to restrict the fuzzy search because in my example it is returning unwanted entities.
Text: Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]
('Dr Disrespect', 'PERSON')
('YouTube', 'PERSON')
The unwanted entity here is ('YouTube', 'PERSON'), is there some way to restrict the fuzzy search so that it does not identify YouTube in the text to be a person?
Full Code:
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
text = """Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
doc = nlp(text)
patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]
ruler = SpaczzRuler(nlp)
ruler.add_patterns(patterns)
doc = ruler(doc)
data = [{
"label": ent.label_,
"name": ent.text,
} for ent in doc.ents]
for ent in doc.ents:
print((ent.text, ent.label_))
EDIT:
i noticed rapidfuzz library provides a score_cutoff
as a parameter im looking to set this to 95 so it's strict. I was hoping something like this could be exposed.
Running my tests with spaczz@master they seem to get into an infinite loop at the nlp()
call. Stack dumps:
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
matches_w_nones = [
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
self._optimize(
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 217, in _optimize
r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
File "doc.pyx", line 308, in spacy.tokens.doc.Doc.__getitem__
File "/usr/lib64/python3.8/site-packages/spacy/util.py", line 491, in normalize_slice
if not (step is None or step == 1):
another ctrl-c during another run:
self._doc = nlp(text)
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
matches_w_nones = [
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
self._optimize(
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 205, in _optimize
rl = self.compare(query, doc[p_l : p_r - f], *args, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 109, in compare
return round(self._fuzzy_funcs.get(fuzzy_func)(a_text, b_text))
There are 1 Million patterns I am trying to add. On adding to blank spacy model:
import spacy from spaczz.pipeline import SpaczzRuler nlp=spacy.blank('en') spaczz_ruler = SpaczzRuler(nlp) spaczz_ruler = nlp.add_pipe("spaczz_ruler") #spaCy v3 syntax spaczz_ruler.add_patterns(patterns)
It takes 8 GB of RAM and inference time is around 28 seconds.
If I try to add SpaczzRuler to current ner pipeline using
spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
It is taking high RAM and time. On 32 GB RAM also it is failing
patterns = [ { "label": "NAME", "pattern": "Grant Andersen", "type": "fuzzy", "kwargs": {"min_r2": 90} }]
Matchers could benefit from inheriting from a base class and the searchers used in them should be composed rather than inherited.
I'm using the SpaczzRuler pipeline as specified in here to detect companies based on patterns. Is a very simple pipeline in which I'm trying to match uppercase tokens, but when using the operator + it matches only one token in uppercase, and not as many as possible. The documentation says:
+ | Require the pattern to match 1 or more times.
However, if using the * operator it indeed matches all possible times, as expected.
import spacy
from spaczz.pipeline import SpaczzRuler
model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
{"label": "COMPANY", 'pattern': [
{"IS_UPPER": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "?"},
{"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
{"IS_PUNCT": True, "OP": "?"}],
"type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)
doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (MARMG S.L.,)
model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
{"label": "COMPANY", 'pattern': [
{"IS_UPPER": True, "OP": "*"}, {"IS_PUNCT": True, "OP": "?"},
{"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
{"IS_PUNCT": True, "OP": "?"}],
"type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)
doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (LARGO AND MARMG S.L.,)
First of all, really appreciate your work and time.
With small input data patterns, it is doing a good job, but when input data patterns crossing more than 1 lakh, it is taking too much time. Is there any possibility which can speed up the process (maybe using on GPU)
Add pipeline component to "clean" entities after setting (primarily intended for spaczz entities). I.e. if punctuation is included at the start/end of a fuzzy matched entity the span can be trimmed to remove punctuation.
Will also require registering a custom span attribute on entities created through spaczz.
Build out Read the Docs .rst documentation for comprehensive details.
Thanks a lot for your fabulous package; it is really helpful. However, when I tried to reproduce your results, I run into this error:
"UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc)"
Specifically, I used this code snippet:
import spacy
from spacy.pipeline import EntityRuler
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("en_core_web_sm")
entity_ruler = nlp.add_pipe("entity_ruler", before="ner")
entity_ruler.add_patterns(
[{"label": "GPE", "pattern": "Nashville"},
{"label": "GPE", "pattern": "TN"}]
)
spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
spaczz_ruler.add_patterns(
[
{
"label": "NAME",
"pattern": "Grant Andersen",
"type": "fuzzy",
"kwargs": {"fuzzy_func": "token_sort"},
},
]
)
When I add patterns to the spaCy's EntityRuler, it works OK, when I add patters to SpaczzRuler, it throws the error I specified in above.
I am using Python 3.9, spaCy 3.2.2 on Ubuntu 16.04.
Any help will be highly appreciated.
There is a lot of redundancy between fuzzy search and similarity search - they both essentially use the same search algorithm. They should inherit the same base and only make the small tweaks necessary.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.