How to reproduce the behaviour I try to set new quality thrsehold

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

quality_test/contains doesn't function about textdescriptives HOT 6 CLOSED

dvirnimrod commented on June 9, 2024

quality_test/contains doesn't function

from textdescriptives.

Comments (6)

HLasse commented on June 9, 2024

Can you add a code snippet that reproduces the behaviour?

from textdescriptives.

KennethEnevoldsen commented on June 9, 2024

@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):

import textdescriptives as td

td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
# 	passed=False, ...
#	contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...

from textdescriptives.

dvirnimrod commented on June 9, 2024

Hi, thanks for the quick respond!

Here's a code snippet for example:

import textdescriptives as td
import spacy
from spacy.cli import download

QUALITY_THRESHOLDS = td.QualityThresholds(
    n_stop_words=(None, None),
    alpha_ratio=(0.6, None),
    mean_word_length=(3, 10),
    doc_length=(1, 1000),
    symbol_to_word_ratio={"@": (None, 0.3)},
    proportion_ellipsis=(None, None),
    proportion_bullet_points=(None, 0.7),
    contains={"fake": False},
    duplicate_line_chr_fraction=(None, 0.2),
    duplicate_paragraph_chr_fraction=(None, 0.2),
    duplicate_ngram_chr_fraction={
        "5": (None, 0.15),
        "6": (None, 0.14),
        "7": (None, 0.13),
        "8": (None, 0.12),
        "9": (None, 0.11),
        "10": (None, 0.1),
    },
    top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
    oov_ratio=(None, 0.3)
)

download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)

text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)

And here's the output:

passed=True 
	n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None)) 
	alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None)) 
	mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0)) 
	doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0)) 
	symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None)) 
	proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7)) 
	contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))} 
	top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))} 
	oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))

As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...

from textdescriptives.

KennethEnevoldsen commented on June 9, 2024

Hi @dvirnimrod. The td.QualityThresholds have default for these. You can disable them e.g. by setting:

    ...
    contains = {} # nothing should be checked
    symbol_to_word_ratio = {} 
    ...

Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though

from textdescriptives.

HLasse commented on June 9, 2024

Ah, I see. It seems that .set_quality_threshold updates the thresholds correctly, but does not set self.contains and self.symbols (which it should). I'll take a look.

EDIT: Fixed in #353

from textdescriptives.

dvirnimrod commented on June 9, 2024

Great!
Thank you guys :)

from textdescriptives.

quality_test/contains doesn't function about textdescriptives HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent