Comments (7)
Further tests show that minimum line length the repetition error is caught is somewhere around 115 characters.
The following loop wraps the file to different line lengths then sends it to the LT server and greps the result for the string RULE
:
$ for i in $(seq 111 120); do echo $i; TEXT=$(fmt -w $i die-bahn.txt); curl -s --data "language=de-DE&text=$TEXT" http://localhost:8081/v2/check | jq | grep RULE; done
111
112
113
114
115
116
117
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
118
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
119
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
120
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
Two matches, because the file attached to the original report contains the same paragraph twice.
A similar command-line tool call (running way slower) would look like this
$ for i in $(seq 111 120); do echo $i; fmt -w $i die-bahn.txt | java -jar /opt/LanguageTool/languagetool-commandline.jar -l de-DE --json 2> /dev/null | jq | grep RULE; done
That is, the error is signaled for a maximum line length of 117 characters and above. Inspecting actual maximum paragraph line length, which is not necessarily equal to the fmt -w
argument, via
$ fmt -w 117 die-bahn.txt | wc -L
116
reveals that the longest line contains in fact 116 characters. Wrapping the text to a line length of 116 characters – where the repetition error wasn't caught – maximum line length drops to 111 characters. So, somewhere between 112 and 116 characters seems to be a magic limit for repetition rule to start working.
from languagetool.
Referring to the edited title, a common line length in a text editor is somewhere in the range 60 to 80 characters per line. *_WORD_REPEAT_BEGINNING_RULE
fails for people working with such a setup.
from languagetool.
The problem affects Thunderbird as well. I've copied the text from the file attached to the original report, opened a new mail in Thunderbird and pasted the text there. It is then automatically wrapped to some standard line length, but LanguageTool was only able to catch the repetition in the second paragraph (with the longer source line length):
Xubuntu 20.04
Thunderbird 115.8.1
LanguageTool-Addon 8.3.0
from languagetool.
For what it's worth, meanwhile, I have been able to reproduce the problem in the Firefox add-on, too. Here's how:
- Open https://pastebin.com/.
- Copy all text from the file attached to the original report into the input field labelled "New Paste".
- Wait for LT add-on to do the checking.
The result should look like this:
Xubuntu 20.04
Firefox 124.0 (deb)
LanguageTool add-on 8.6.0
from languagetool.
Not too enthusiastic about giving back (a little) via bug reports anymore, given the phenomenal feedback rate visible here and on the forum for issues other than plain word error suggestions. How easy can a bug be to reproduce? No interaction with a third-party application necessary. Really, I haven't expected this issue to stay open for more than a week or so. (And no this is not meant demanding. Not giving any feedback at all or just the most terse possible by default is what I'm putting into question. I know you make money with this code and that's OK. But keep in mind that obviously non-paying users giving technical feedback may be staff who put their thumbs up or down before installing the software on individual user's computers, whether paying or non-paying ones.)
Anyway, I can confirm this bug to affect the stand-alone LT application, too. As before, the text file can be found attached to the first comment.
Ubuntu 22.04
LanguageTool 6.4
OpenJDK 11
from languagetool.
In general, the end of line character creates a new sentence for LanguageTool. Considering this, the rule matches as expected when there are three sentences starting with the same word. In other words, LanguageTool doesn't work well with hard-coded newlines.
from languagetool.
Thank you for giving this insight. That could explain the issue. If sentences are indeed broken at line breaks, that would render grammar rules largely pointless. Will watch that.
On the other hand, I don't think this is the whole truth. Because given the behaviour you described, shouldn't lines starting with a lowercase word trigger UPPERCASE_SENTENCE_START rule, e.g., on line 3 or 4 in the last screenshot?
from languagetool.
Related Issues (20)
- case_sensitive='yes' doesn't work properly in antipattern — 2024-05-29
- ru: false positive with беспроводной
- New German suggestion(s) HOT 4
- [MacOS] Add an option to remove the app icon from the menu bar HOT 2
- [de] `Pepsinwein` erroneously marked as error
- Valencian language
- Languagetool extension 6.4 gives out an error message while saveing the file. HOT 1
- [en] LT from the command line: don't give useless message [enhancement request] HOT 1
- Phrases aren't matched by rules unless they are the provided example sentences HOT 1
- Languagetool vs WYSIWYG HOT 1
- Shadow DOM breaks CSS inherit HOT 1
- [pt] Idea for rule: “naquele” → “no” — 2024-06-25 HOT 3
- Sentence splitting issues HOT 3
- Pequeno erro no dicionário de português
- LanguageTool injects code / modifies dom even if it's disabled
- Overlay beneath focused window HOT 6
- Night results appear blank — 2024-06-28 HOT 1
- Vorschläge: "Ichhaftigkeit" und des "Yogins"
- Buggy interaction with Teams Web V2 HOT 3
- Portuguese words
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from languagetool.