Git Product home page Git Product logo

Comments (6)

HLasse avatar HLasse commented on June 9, 2024

Can you add a code snippet that reproduces the behaviour?

from textdescriptives.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 9, 2024

@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):

import textdescriptives as td

td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
# 	passed=False, ...
#	contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...

from textdescriptives.

dvirnimrod avatar dvirnimrod commented on June 9, 2024

Hi, thanks for the quick respond!

Here's a code snippet for example:

import textdescriptives as td
import spacy
from spacy.cli import download

QUALITY_THRESHOLDS = td.QualityThresholds(
    n_stop_words=(None, None),
    alpha_ratio=(0.6, None),
    mean_word_length=(3, 10),
    doc_length=(1, 1000),
    symbol_to_word_ratio={"@": (None, 0.3)},
    proportion_ellipsis=(None, None),
    proportion_bullet_points=(None, 0.7),
    contains={"fake": False},
    duplicate_line_chr_fraction=(None, 0.2),
    duplicate_paragraph_chr_fraction=(None, 0.2),
    duplicate_ngram_chr_fraction={
        "5": (None, 0.15),
        "6": (None, 0.14),
        "7": (None, 0.13),
        "8": (None, 0.12),
        "9": (None, 0.11),
        "10": (None, 0.1),
    },
    top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
    oov_ratio=(None, 0.3)
)

download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)

text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)

And here's the output:

passed=True 
	n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None)) 
	alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None)) 
	mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0)) 
	doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0)) 
	symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None)) 
	proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7)) 
	contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
	duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
	duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))} 
	top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))} 
	oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))

As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...

from textdescriptives.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 9, 2024

Hi @dvirnimrod. The td.QualityThresholds have default for these. You can disable them e.g. by setting:

    ...
    contains = {} # nothing should be checked
    symbol_to_word_ratio = {} 
    ...

Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though

from textdescriptives.

HLasse avatar HLasse commented on June 9, 2024

Ah, I see. It seems that .set_quality_threshold updates the thresholds correctly, but does not set self.contains and self.symbols (which it should). I'll take a look.

EDIT: Fixed in #353

from textdescriptives.

dvirnimrod avatar dvirnimrod commented on June 9, 2024

Great!
Thank you guys :)

from textdescriptives.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.