Comments (3)
Thank you very much for your comprehensive and thoughtful explanations and comments! This indeed answers my question. This is very helpful, I appreciate that.
All the best!
from textdescriptives.
Removing the bug label as it isn't clear that it is a bug.
Thanks posting this issue. Our formulas for calculations are available here and implementation generally follow those of the package spacy-readability
. However, do note that all of these metrics relies on estimated properties for instance determining average sentence length for flesch_reading_ease
requires detection of sentence boundaries (using a different underlying model in textdescriptives will yield different results in edge cases, but generally it is fairly robust). This also seems like it is support by the fact that the mean sentence length is not a perfect match. The difference however seems to big... (in cases of disagreement it seems like spacy is typically better than textstat)
Actually it seems like textstat uses a different constant for Fleiss reading ease than our implementation (probably the main cause). The source is unclear but googling seems to stem from a German Thesis. So we assume a language agnostic constant (which is fitted on English) It might be better to use a language specific constant instead. Hmm there might also be a better default constant than the English one. (@HLasse, @LudvigOlsen what are your thoughts?). Seems like the syllables threshold for what constitutes hard words. textstat seems to use a default of 2 for German, where we use 3 (same as English), which is probably too low for German. It should probably be higher due to compound word.
from textdescriptives.
Actually, IIRC, we based our implementation off of textstat
so the values should be fairly similar (except for the differences in sentence detection etc., that you bring up, Kenneth).
+1 for Kenneth's comment: the minor deviations are most likely due to different sentence boundary detection and tokenization methods and are fairly negligible and tend to even out with longer texts.
Re. flesch reading ease, the implementations and constants are completely similar between textdescriptives and textstat for the English language. My initial suspicion was that we use different modules for hyphenation (counting syllables), but both use Update: hyphenation works the same across textstat and textdescriptives. The difference is because of the different constants for the German language that textstat uses.pyphen
. So, there is likely to be a bug or at least an implementation difference in the way syllables are counted between the two libraries (textstat implementation, textdescriptives implementation. We should definitely look into this.
Re. gunning fog: This boils down to what Kenneth says: textstat
uses a different threshold for hard words for German (2 syllables) than for English (3 syllables), where we use the threshold of 3 syllables regardless of language.
Re. your last point on the reliability of metrics for languages besides English: We have sought to implement metrics that are broadly reliable across all languages. What we mean by this, is that the rank-ordering of texts in terms of any of the metrics will correctly order them by reading ease. However, some metrics ( Flesch-Kincaid Grade) have constants that have been derived through modelling English text. The grade level will therefore likely only be reliable for English, but the metric is still useful for other languages for ranking the difficulty of texts (and is likely not that far off).
Hope this answers your question, @rnckp!
from textdescriptives.
Related Issues (20)
- Demos / Browser-Based Usage HOT 34
- Fails for empty strings HOT 4
- Support transformer models
- Pyphen does not support all languages that spaCy does HOT 2
- Listed metrics deviate between extraction functions in docs HOT 7
- References for readability metrics HOT 4
- Croatian language requires `lexeme_prob` table HOT 6
- Readability documentation is wrongly formatted
- Migrate to swift CI template HOT 4
- Not all pos_proportions output HOT 3
- docs: fix documentation building following pydantic_autodoc v2.0 HOT 2
- tutorials: quality tutorial failing due to DAGW not being available at the moment
- app: streamlit app is broken on the HF space HOT 2
- Update quality filters to be pydantic v.2.0 compliant / convert to dataclasses HOT 2
- The link to documentation in the News section of the README points to the repo, not the docs HOT 1
- Module import error in docstring HOT 1
- ValueError in docstring examples - "Can't find factory for 'textdescriptives/..." HOT 1
- quality.duplicate_ngram_fraction(...) expected values ? HOT 5
- quality_test/contains doesn't function HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from textdescriptives.