Git Product home page Git Product logo

Comments (7)

hopp19 avatar hopp19 commented on July 4, 2024

Further tests show that minimum line length the repetition error is caught is somewhere around 115 characters.

The following loop wraps the file to different line lengths then sends it to the LT server and greps the result for the string RULE:

$ for i in $(seq 111 120); do echo $i; TEXT=$(fmt -w $i die-bahn.txt); curl -s --data "language=de-DE&text=$TEXT" http://localhost:8081/v2/check | jq | grep RULE; done
111
112
113
114
115
116
117
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
118
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
119
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
120
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
        "id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",

Two matches, because the file attached to the original report contains the same paragraph twice.

A similar command-line tool call (running way slower) would look like this

$ for i in $(seq 111 120); do echo $i; fmt -w $i die-bahn.txt | java -jar /opt/LanguageTool/languagetool-commandline.jar -l de-DE --json 2> /dev/null | jq | grep RULE; done

That is, the error is signaled for a maximum line length of 117 characters and above. Inspecting actual maximum paragraph line length, which is not necessarily equal to the fmt -w argument, via

$ fmt -w 117 die-bahn.txt | wc -L
116

reveals that the longest line contains in fact 116 characters. Wrapping the text to a line length of 116 characters – where the repetition error wasn't caught – maximum line length drops to 111 characters. So, somewhere between 112 and 116 characters seems to be a magic limit for repetition rule to start working.

from languagetool.

hopp19 avatar hopp19 commented on July 4, 2024

Referring to the edited title, a common line length in a text editor is somewhere in the range 60 to 80 characters per line. *_WORD_REPEAT_BEGINNING_RULE fails for people working with such a setup.

from languagetool.

hopp19 avatar hopp19 commented on July 4, 2024

The problem affects Thunderbird as well. I've copied the text from the file attached to the original report, opened a new mail in Thunderbird and pasted the text there. It is then automatically wrapped to some standard line length, but LanguageTool was only able to catch the repetition in the second paragraph (with the longer source line length):

tb-pc

Xubuntu 20.04
Thunderbird 115.8.1
LanguageTool-Addon 8.3.0

from languagetool.

hopp19 avatar hopp19 commented on July 4, 2024

For what it's worth, meanwhile, I have been able to reproduce the problem in the Firefox add-on, too. Here's how:

  1. Open https://pastebin.com/.
  2. Copy all text from the file attached to the original report into the input field labelled "New Paste".
  3. Wait for LT add-on to do the checking.

The result should look like this:

dbff-2-pc

Xubuntu 20.04
Firefox 124.0 (deb)
LanguageTool add-on 8.6.0

from languagetool.

hopp19 avatar hopp19 commented on July 4, 2024

Not too enthusiastic about giving back (a little) via bug reports anymore, given the phenomenal feedback rate visible here and on the forum for issues other than plain word error suggestions. How easy can a bug be to reproduce? No interaction with a third-party application necessary. Really, I haven't expected this issue to stay open for more than a week or so. (And no this is not meant demanding. Not giving any feedback at all or just the most terse possible by default is what I'm putting into question. I know you make money with this code and that's OK. But keep in mind that obviously non-paying users giving technical feedback may be staff who put their thumbs up or down before installing the software on individual user's computers, whether paying or non-paying ones.)

Anyway, I can confirm this bug to affect the stand-alone LT application, too. As before, the text file can be found attached to the first comment.

bug-LT-6 4-REPEAT_RULE

Ubuntu 22.04
LanguageTool 6.4
OpenJDK 11

from languagetool.

jaumeortola avatar jaumeortola commented on July 4, 2024

In general, the end of line character creates a new sentence for LanguageTool. Considering this, the rule matches as expected when there are three sentences starting with the same word. In other words, LanguageTool doesn't work well with hard-coded newlines.

from languagetool.

hopp19 avatar hopp19 commented on July 4, 2024

Thank you for giving this insight. That could explain the issue. If sentences are indeed broken at line breaks, that would render grammar rules largely pointless. Will watch that.

On the other hand, I don't think this is the whole truth. Because given the behaviour you described, shouldn't lines starting with a lowercase word trigger UPPERCASE_SENTENCE_START rule, e.g., on line 3 or 4 in the last screenshot?

from languagetool.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.