Git Product home page Git Product logo

Comments (3)

rnckp avatar rnckp commented on June 9, 2024 2

@HLasse @KennethEnevoldsen

Thank you very much for your comprehensive and thoughtful explanations and comments! This indeed answers my question. This is very helpful, I appreciate that.

All the best!

from textdescriptives.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 9, 2024 1

Removing the bug label as it isn't clear that it is a bug.

Thanks posting this issue. Our formulas for calculations are available here and implementation generally follow those of the package spacy-readability. However, do note that all of these metrics relies on estimated properties for instance determining average sentence length for flesch_reading_ease requires detection of sentence boundaries (using a different underlying model in textdescriptives will yield different results in edge cases, but generally it is fairly robust). This also seems like it is support by the fact that the mean sentence length is not a perfect match. The difference however seems to big... (in cases of disagreement it seems like spacy is typically better than textstat)

Actually it seems like textstat uses a different constant for Fleiss reading ease than our implementation (probably the main cause). The source is unclear but googling seems to stem from a German Thesis. So we assume a language agnostic constant (which is fitted on English) It might be better to use a language specific constant instead. Hmm there might also be a better default constant than the English one. (@HLasse, @LudvigOlsen what are your thoughts?). Seems like the syllables threshold for what constitutes hard words. textstat seems to use a default of 2 for German, where we use 3 (same as English), which is probably too low for German. It should probably be higher due to compound word.

from textdescriptives.

HLasse avatar HLasse commented on June 9, 2024 1

Actually, IIRC, we based our implementation off of textstat so the values should be fairly similar (except for the differences in sentence detection etc., that you bring up, Kenneth).

+1 for Kenneth's comment: the minor deviations are most likely due to different sentence boundary detection and tokenization methods and are fairly negligible and tend to even out with longer texts.

Re. flesch reading ease, the implementations and constants are completely similar between textdescriptives and textstat for the English language. My initial suspicion was that we use different modules for hyphenation (counting syllables), but both use pyphen. So, there is likely to be a bug or at least an implementation difference in the way syllables are counted between the two libraries (textstat implementation, textdescriptives implementation. We should definitely look into this. Update: hyphenation works the same across textstat and textdescriptives. The difference is because of the different constants for the German language that textstat uses.

Re. gunning fog: This boils down to what Kenneth says: textstat uses a different threshold for hard words for German (2 syllables) than for English (3 syllables), where we use the threshold of 3 syllables regardless of language.

Re. your last point on the reliability of metrics for languages besides English: We have sought to implement metrics that are broadly reliable across all languages. What we mean by this, is that the rank-ordering of texts in terms of any of the metrics will correctly order them by reading ease. However, some metrics ( Flesch-Kincaid Grade) have constants that have been derived through modelling English text. The grade level will therefore likely only be reliable for English, but the metric is still useful for other languages for ranking the difficulty of texts (and is likely not that far off).

Hope this answers your question, @rnckp!

from textdescriptives.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.