Comments (6)
Can you add a code snippet that reproduces the behaviour?
from textdescriptives.
@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):
import textdescriptives as td
td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
# passed=False, ...
# contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...
from textdescriptives.
Hi, thanks for the quick respond!
Here's a code snippet for example:
import textdescriptives as td
import spacy
from spacy.cli import download
QUALITY_THRESHOLDS = td.QualityThresholds(
n_stop_words=(None, None),
alpha_ratio=(0.6, None),
mean_word_length=(3, 10),
doc_length=(1, 1000),
symbol_to_word_ratio={"@": (None, 0.3)},
proportion_ellipsis=(None, None),
proportion_bullet_points=(None, 0.7),
contains={"fake": False},
duplicate_line_chr_fraction=(None, 0.2),
duplicate_paragraph_chr_fraction=(None, 0.2),
duplicate_ngram_chr_fraction={
"5": (None, 0.15),
"6": (None, 0.14),
"7": (None, 0.13),
"8": (None, 0.12),
"9": (None, 0.11),
"10": (None, 0.1),
},
top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
oov_ratio=(None, 0.3)
)
download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)
text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)
And here's the output:
passed=True
n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None))
alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None))
mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0))
doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0))
symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)}
proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None))
proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7))
contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)}
duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2))
duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2))
duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}
top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))}
oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))
As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...
from textdescriptives.
Hi @dvirnimrod. The td.QualityThresholds
have default for these. You can disable them e.g. by setting:
...
contains = {} # nothing should be checked
symbol_to_word_ratio = {}
...
Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though
from textdescriptives.
Ah, I see. It seems that .set_quality_threshold
updates the thresholds correctly, but does not set self.contains
and self.symbols
(which it should). I'll take a look.
EDIT: Fixed in #353
from textdescriptives.
Great!
Thank you guys :)
from textdescriptives.
Related Issues (20)
- Demos / Browser-Based Usage HOT 34
- Fails for empty strings HOT 4
- Support transformer models
- Pyphen does not support all languages that spaCy does HOT 2
- Listed metrics deviate between extraction functions in docs HOT 7
- References for readability metrics HOT 4
- Croatian language requires `lexeme_prob` table HOT 6
- Different readability scores between textdescriptives and textstat HOT 3
- Readability documentation is wrongly formatted
- Migrate to swift CI template HOT 4
- Not all pos_proportions output HOT 3
- docs: fix documentation building following pydantic_autodoc v2.0 HOT 2
- tutorials: quality tutorial failing due to DAGW not being available at the moment
- app: streamlit app is broken on the HF space HOT 2
- Update quality filters to be pydantic v.2.0 compliant / convert to dataclasses HOT 2
- The link to documentation in the News section of the README points to the repo, not the docs HOT 1
- Module import error in docstring HOT 1
- ValueError in docstring examples - "Can't find factory for 'textdescriptives/..." HOT 1
- quality.duplicate_ngram_fraction(...) expected values ? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from textdescriptives.