Git Product home page Git Product logo

hlasse / textdescriptives Goto Github PK

View Code? Open in Web Editor NEW
288.0 6.0 22.0 20.35 MB

A Python library for calculating a large variety of metrics from text

Home Page: https://hlasse.github.io/TextDescriptives/

License: Apache License 2.0

Python 86.72% TeX 13.28%
nlp python statistics syntactic-analysis readability readability-scores descriptive-statistics spacy spacy-extension dependency-distance

textdescriptives's Introduction

TextDescriptives

spacy github actions pytest github actions docs status Open in Streamlit

A Python library for calculating a large variety of metrics from text(s) using spaCy v.3 pipeline components and extensions.

πŸ”§ Installation

pip install textdescriptives

πŸ“° News

  • We now have a TextDescriptives-powered web-app so you can extract and downloads metrics without a single line of code! Check it out here
  • Version 2.0 out with a new API, a new component, updated documentation, and tutorials! Components are now called by "textdescriptives/{metric_name}. New coherence component for calculating the semantic coherence between sentences. See the documentation for tutorials and more information!

⚑ Quick Start

Use extract_metrics to quickly extract your desired metrics. To see available methods you can simply run:

import textdescriptives as td
td.get_valid_metrics()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}

Set the spacy_model parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang. If lang is set, spacy_model is not necessary and vice versa.

Specify which metrics to extract in the metrics argument. None extracts all metrics.

import textdescriptives as td

text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (Β΄en_core_web_lgΒ΄) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)

# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])

Usage with spaCy

To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/.

If you want to add all components you can use the shorthand textdescriptives/all.

import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics from a Doc to a Pandas DataFrame or a dictionary.

td.extract_dict(doc)
td.extract_df(doc)
text first_order_coherence second_order_coherence pos_prop_DET pos_prop_NOUN pos_prop_AUX pos_prop_VERB pos_prop_PUNCT pos_prop_PRON pos_prop_ADP pos_prop_ADV pos_prop_SCONJ flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix n_stop_words alpha_ratio mean_word_length doc_length proportion_ellipsis proportion_bullet_points duplicate_line_chr_fraction duplicate_paragraph_chr_fraction duplicate_5-gram_chr_fraction duplicate_6-gram_chr_fraction duplicate_7-gram_chr_fraction duplicate_8-gram_chr_fraction duplicate_9-gram_chr_fraction duplicate_10-gram_chr_fraction top_2-gram_chr_fraction top_3-gram_chr_fraction top_4-gram_chr_fraction symbol_#_to_word_ratio contains_lorem ipsum passed_quality_check dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 The world is changed(...) 0.633002 0.573323 0.097561 0.121951 0.0731707 0.170732 0.146341 0.195122 0.0731707 0.0731707 0.0487805 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 24 0.853659 2.95122 41 0 0 0 0 0.232258 0.232258 0 0 0 0 0.0580645 0.174194 0 0 False False 1.77524 0.553188 0.457143 0.0722806 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5

πŸ“– Documentation

TextDescriptives has a detailed documentation as well as a series of Jupyter notebook tutorials. All the tutorials are located in the docs/tutorials folder and can also be found on the documentation website.

Documentation
πŸ“š Getting started Guides and instructions on how to use TextDescriptives and its features.
πŸ‘©β€πŸ’» Demo A live demo of TextDescriptives.
😎 Tutorials Detailed tutorials on how to make the most of TextDescriptives
πŸ“° News and changelog New additions, changes and version history.
πŸŽ› API References The detailed reference for TextDescriptive's API. Including function documentation
πŸ“„ Paper The preprint of the TextDescriptives paper.

textdescriptives's People

Contributors

actions-user avatar dependabot[bot] avatar eltociear avatar frillecode avatar hlasse avatar kennethenevoldsen avatar ludvigolsen avatar martinbernstorff avatar pre-commit-ci[bot] avatar rbroc avatar sdruskat avatar sondalex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

textdescriptives's Issues

Update CI

  • Auto bump version ID, added in #110
  • Auto accept dependabot, added in #110 (or was added beforehand, but at least here)
  • Add concurrency to pytests, added in #110
  • CI for tutorials
  • #98
  • #99

Add word embedding coherence/similary

Use spacy's word embeddings to calculate the similarity of the previous n words following

Alberto, P., Mary, L. J., Arndis, S., Vibeke, B., Yuan, Z., Huiling, W., Lana, I., Katja, K., & Riccardo, F. (2022). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. MedRxiv.

Separate component loaders

We should consider separating some of the components from each other, so the user doesn't get overloaded. (also, speed)

I propose:

  1. textdescriptives: includes everything but `quality``
  2. quality: only quality

Thoughts, @KennethEnevoldsen ?

Create advanced tutorial

How to use TD for quality filtering -- other cool use cases of quality filtering?

Usage for feature extraction for model training?

Make output of doc._.quality specify if each filter was passed

For instance instead of returning:

{'n_stop_words': 435,
 'alpha_ratio': 0.79,
 'mean_word_length': 1.3,
 'doc_length': 894,
...
}

We could return the values as a tuple with the second value indicating whether it passed (given the current threshold)

{'n_stop_words': (435, True),
 'alpha_ratio': (0.79, True),
 'mean_word_length': (1.3, False),
 'doc_length': (894, True),

Alternatively, return a dataclass with a .value and a .passed attribute?

Create wrapper function

Create a wrapper function to remove spacy boilerplate. Suggested API

td.extract(texts, lang="en", metrics=["dependency_distance", "readability"])

Outputs a dataframe.

Can use the language tag to infer which spacy model to download (always lang_core_news/web_lg). Alternatively, supply a spacy pipeline

td.extract(texts, spacy_model="da_core_news_sm", metrics=["descriptive_stats"]

Make clever checks to only add spacy models if metrics actually need it (e.g. no model for descriptive stats and readability).

Could be easily extended to a CLI

textdescriptives extract -i file1.txt file2.txt -m readability quality -l en

Add_pipe problem

Hello,
I am tying textdescriptives on a Python 3.6 + spaCy3.1 on a Linux system through JupyterLab.

I found this error. Not quite sure how to deal with spaCy decorator. Could you please help? Thanks!

nlp.add_pipe("textdescriptives")

ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Make `quality` work with `n_process` > 1

In descriptive_statistics the issue was saving spans/tokens to an attribute. As they are not serializable, we get errors.

Workarounds:

  1. Don't set attributes with tokens/spans..
  2. Set the attribute as (token, start_idx, end_idx). This is serializable and can be reconstructed to be the original span

Add measure of 'surprise' in next token

Use a generative model to calculate the average perplexity of documents (or pseudo perplexity for masked language models).

  • Load a model from huggingface
  • Calculate token level perplexity

@KennethEnevoldsen Any other ideas for calculating surprise?

Formulae for dependency distance calculation on Doc level

Hi,

first of all thanks for this very helpful library. I have a question regarding the way dependency distance (DD) for Doc objects is calculated.

Your function on calculating DD for a Doc, returns the DD value here: "dependency_distance_mean": np.mean(dep_dists). The mean returned is as far as I can see, the mean over mean DD of every sentence (contained in dep_dist) constituting the Doc object.

the two sources you cite on dependency distance in your documentation (Liu 2008 and Oya, 2008), however, have a different approach.

For calculating DD of a text, Liu seems to take the sum of the absolute DD found in the whole text and multiplies by (1 / (number of words - the number of sentences). Oya seems to takes a mean of means like you do, but for a sentence averages sum of absolute DD over the number of dependency links in an utterance. In your documentation nor in your code I can retrieve how you exactly calculate DD for a text.

Would you please be so kind as to explain with what approach you calculate DD for Doc objects, and provide pointers on how we may adapt the code to e.g. implement approaches by Liu and Oya? Thanks!

Which page or section is this issue related to?

https://hlasse.github.io/TextDescriptives/dependencydistance.html

Update quality metrics docs

Some of the metrics are bit hard to understand - mainly the repetitious text metrics:

Duplicate lines character fraction (duplicate_lines_chr_fraction): Fraction of characters in a document which are contained within duplicate lines.

Duplicate paragraphs character fraction (duplicate_paragraphs_chr_fraction): Fraction of characters in a document which are contained within duplicate paragraphs.

Duplicate n-gram character fraction (duplicate_{n}_gram_chr_fraction): Fraction of characters in a document which are contained within duplicate n-grams. For a speciifed n-gram range.

Top n-gram character fraction (top_{n}_gram_chr_fraction): Fraction of characters in a document which are contained within the top n-grams. For a speciifed n-gram range.

Optimize pos-stats

Calculating POS stats seems to slow things down significantly. TODO:

  • Profile the package, what causes the slowdown?
  • Calculate the sum of values once and then call in the dict comprehension in PosStatistics

Ideas for speedup:

  • Identify which pos_tags the model can make and predefine the counter/dictionary with those keys (would also solve the issue of different numbers of keys across docs/sentences)
  • Alternatives to Counter?

Other options:

  • Remove posstats from default TextDescriptives and make it an optional component that takes in which specific POS tags the user is interested in and extracts those (+ 'others')

Write JOSS paper

  • Summary describing the purpose of the software
  • Statement of need
  • Package features & functionality
  • Target audience
  • References to other software addressing related needs
  • Past or ongoing research projects using the software

Create simple tutorial

Tutorial describing basic functionality -- any good examples?

E.g. development of sentence complexity over time in historical texts? (ADL data)

Avoid calculating metrics multiple times

Currently, a series of metrics use the same underlying metrics or variables, e.g. document length. There is no reason that these should be calculated repeatedly, if we change the underlying getter function to a setter we can avoid this.

Update docs

Update docs to correctly add and format docstrings, and document all functionality that has been added gradually.

Update README

Very minimal example in README -- move focus to documentation

ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

How to reproduce the behaviour

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
doc..readability
doc.
.token_length

Environment

Name: textdescriptives Version: 0.1.1
Windows 10
Python 3.6

Error message

ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, spancat, textcat_multilabel, en.lemmatizer

ValueError Traceback (most recent call last)
in
6 import textdescriptives as td
7 nlp = spacy.load('en_core_web_sm')
----> 8 nlp.add_pipe('textdescriptives')
9 doc = nlp('This is a short test text')
10 doc._.readability # access some of the values

~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
780 config=config,
781 raw_config=raw_config,
--> 782 validate=validate,
783 )
784 pipe_index = self._get_pipe_index(before, after, first, last)

~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
639 lang_code=self.lang,
640 )
--> 641 raise ValueError(err)
642 pipe_meta = self.get_factory_meta(factory_name)
643 config = config or {}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.