Comments (17)
We should first measure if this really helps before we introduce any complexity.
from languagetool.
Stefan Lotties [email protected] wrote:
I've noticed that quite a few tokens in the grammar.xml (at least in the
german one) use regexp-tokens but just define a list of possible values,
e.g. like this:an|auf|durch|für|gegen|über
It would be nice to introduce multivalue-tokens for such cases, maybe like
this:an|auf|durch|für|gegen|über
Either multivalue or regexp may be defined, but not both.
I think it would be the best to store the token values as Set because
Element.isStringTokenMatched() must simply use Set.contains() then, which
is much faster than a regular expression that just holds a list of possible
values.It's not clear if existing rules should be changed or if there's a
mechanism that checks the token value and interprets it as a multivalue
token instead of a regular expression.It's already known that most of the time is spent in the method for
matching rules (see New Pattern Matching @
http://wiki.languagetool.org/missing-features), so this improvement will
not change the world, but will slightly improve the performance.Constraints:
- the old way must still work
The question is: will it be 10% faster, or 0.01% faster?
I'm guessing that the speed gain will negligible and if so, it's not worth.
Regexp can be surprisingly fast once compiled (but admittedly the
multi-value token can cause lots of backtracking so can be slow).
If we can prove that it will speed up (with a prototype), then I'm all for
it.
I think that there is more room for improvement elsewhere.
For example, I only glanced at the code and it seems that
thread load balancing is done by giving each thread
more or less the same number of rules to process. Correct?
However, the Hunspell rule for example are much slower
than the other grammar rules. So thread having to check
Hunspell (+ other grammar rules) will be a bottleneck and other
threads will wait for it. I may be wrong since I don't know this
code well, so it's just a hunch. Having a queue and having threads
in pool pick rules to process from a queue, rather than assigning
fixed number of rules per thread may improve load balancing
and thus speed up LT hopefully significantly. Or less rules could
be assigned to thread doing Hunspell. I remember seeing
that CPU utilization was much lower than the number of cores
on my machine, so it seems that something can be improved
there.
Dominique
from languagetool.
Dominique, you're right, it would make more sense to feed the rules to free threads instead of splitting them into groups before the work begins. Another thing that could be parallelized is text analysis (splitting sentences into words, assigning POS tags).
from languagetool.
It's probably as easy as this: http://stackoverflow.com/questions/9916264/java-thread-simple-queue/9916299#9916299, but I just tried it and it makes checking actually slower. Probably because creating all those Callables also comes with some overhead.
from languagetool.
I just did a quick performance test on Pattern
vs. Set
.
Code:
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String[] tokensToTest = {
// Hits
"a",
"de",
"pa",
"so",
"con",
"en",
"per",
"por",
"tou",
"que",
// Misses
"abc",
"def",
"ghi",
"x",
"y",
"z",
"0",
"1",
"2",
"test"
};
Pattern pattern = Pattern.compile("a|de|pa|so|con|en|per|por|tou|que");
Set<String> values = new HashSet<>();
values.add("a");
values.add("de");
values.add("pa");
values.add("so");
values.add("con");
values.add("en");
values.add("per");
values.add("por");
values.add("tou");
values.add("que");
int testRuns = 1000000;
long t0_set = System.currentTimeMillis();
for (int i = 0; i < testRuns; i++) {
for (String token : tokensToTest) {
values.contains(token);
}
}
long t1_set = System.currentTimeMillis();
long t0_pattern = System.currentTimeMillis();
for (int i = 0; i < testRuns; i++) {
for (String token : tokensToTest) {
pattern.matcher(token).matches();
}
}
long t1_pattern = System.currentTimeMillis();
System.out.printf("Set test : %d\n", t1_set - t0_set);
System.out.printf("Pattern test: %d\n", t1_pattern - t0_pattern);
}
}
On my Intel Core 2 Duo (3 GHz), Java 7 (u40, 64 Bit, no special JVM parameters) the results are:
Set test : 311
Pattern test: 3423
Then I searched for all regexp tokens and exceptions.
Regexp used to find regexp tokens: <(token|exception)[\s\w"=]* regexp="yes"[\s\w"=]>
Regexp used to find possible multivalue tokens: <(token|exception)[\s\w"=] regexp="yes"[\s\w"=]>[ a-zA-Z0-9,|-]</(token|exception)
Here's a list of all languages that have at least 200 regexp tokens defined. I guess all languages with less than this amount are not relevant to this improvement. All XML files of a language were scanned (grammar.xml, disambiguation.xml, bitext.xml).
Language | # of regexp tokens | # of multivalue tokens
-------------|--------------------|--------------------
br (disamb.) | 38 | 5
br (grammar) | 762 | 85
ca (disamb.) | 475 | 171
ca (grammar) | 1180 | 325
de (disamb.) | 1 | 0
de (grammar) | 2045 | 458
en (disamb.) | 77 | 30
en (grammar) | 583 | 417
eo (disamb.) | 4 | 0
eo (grammar) | 243 | 28
fr (disamb.) | 89 | 26
fr (grammar) | 2966 | 657
nl (disamb.) | 4 | 0
nl (grammar) | 287 | 133
pl (disamb.) | 52 | 6
pl (bitext) | 2 | 2
pl (grammar) | 745 | 111
ro (grammar) | 336 | 137
ru (disamb.) | 18 | 0
ru (bitext) | 1 | 1
ru (grammar) | 235 | 3
zh (grammar) | 201 | 0
That is much less than I expected. The major languages here (ca, de, fr) have like 1/4 to 1/3 possible multivalue tokens.
from languagetool.
I expected this change to be measurable. But with these results I guess there won't be any benefit unless you do thousands of checks in one of the languages with big rule files. And even then the difference cannot be felt, but just measured with a very small difference.
from languagetool.
@dpelle , @danielnaber
Parallelizing the rule processing might work for standalone applications or server farms which are designed to "map and reduce", but in old fashioned servers you cannot parallelize as much as you want to. In such scenarios LanguageTool is just a minor part of the whole system and therefore should use as few resources as possible.
From my experience such servers have a small amount of CPUs (or cores), typically up to 16, and they have to serve many HTTP requests already (in the old-fashioned one-thread-per-request model).
btw, the standalone application is using the MultithreadedJLanguageTool, right? The combination might become a problem then.
from languagetool.
@slotties From your estimate, it seems that multivalue tokens are processed ten times quicker as Sets. If this is correct, then we'd expect 10-fold speedup for 1/3 regexp rules. I think it's worth trying. Also, remember that some multivalue tokens contain additional regexes for speedup (like 'a[bc]|ada|foobars?') - if we linearize such regexes, there might be more multivalue tokens to start with. Disjunction is THE bottleneck of regexes.
The reason why I think it's worth trying is that I had a fairly long multivalue token in one of the rules (comprising around 100 items) and removing it (by adding a new POS tag) made a significant speed increase.
Anyway, for short text, LT is already quite fast.
from languagetool.
Stefan Lotties [email protected] wrote:
I just did a quick performance test on Pattern vs. Set.
Code:
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;public class Test {
public static void main(String[] args) {
String[] tokensToTest = {
// Hits
"a",
"de",
"pa",
"so",
"con",
"en",
"per",
"por",
"tou",
"que",
// Misses
"abc",
"def",
"ghi",
"x",
"y",
"z",
"0",
"1",
"2",
"test"
};Pattern pattern = Pattern.compile("a|de|pa|so|con|en|per|por|tou|que"); Set<String> values = new HashSet<>(); values.add("a"); values.add("de"); values.add("pa"); values.add("so"); values.add("con"); values.add("en"); values.add("per"); values.add("por"); values.add("tou"); values.add("que"); int testRuns = 1000000; long t0_set = System.currentTimeMillis(); for (int i = 0; i < testRuns; i++) { for (String token : tokensToTest) { values.contains(token); } } long t1_set = System.currentTimeMillis(); long t0_pattern = System.currentTimeMillis(); for (int i = 0; i < testRuns; i++) { for (String token : tokensToTest) { pattern.matcher(token).matches(); } } long t1_pattern = System.currentTimeMillis(); System.out.printf("Set test : %d\n", t1_set - t0_set); System.out.printf("Pattern test: %d\n", t1_pattern - t0_pattern);
}
}On my Intel Core 2 Duo (3 GHz), Java 7 (u40, 64 Bit, no special JVM
parameters) the results are:Set test : 311
Pattern test: 3423Then I searched for all regexp tokens and exceptions.
Regexp used to find regexp tokens: <(token|exception)[\s\w"=]*
regexp="yes"[\s\w"=]>
Regexp used to find possible multivalue tokens:
<(token|exception)[\s\w"=] regexp="yes"[\s\w"=]>[
a-zA-Z0-9,|-]</(token|exception)Here's a list of all languages that have at least 200 regexp tokens
defined. I guess all languages with less than this amount are not relevant
to this improvement. All XML files of a language were scanned (grammar.xml,
disambiguation.xml, bitext.xml).
Language # of regexp tokens # of multivalue tokens br (disamb.) 38 5 br (grammar) 762 85 ca (disamb.) 475 171 ca (grammar) 1180 325 de (disamb.) 1 0 de (grammar) 2045 458 en (disamb.) 77 30 en (grammar) 583 417 eo (disamb.) 4 0 eo (grammar) 243 28 fr (disamb.) 89 26 fr (grammar) 2966 657 nl (disamb.) 4 0 nl (grammar) 287 133 pl (disamb.) 52 6 pl (bitext) 2 2 pl (grammar) 745 111 ro (grammar) 336 137 ru (disamb.) 18 0 ru (bitext) 1 1 ru (grammar) 235 3 zh (grammar) 201 0 That is much less than I expected. The major languages here (ca, de, fr)
have like 1/4 to 1/3 possible multivalue tokens.
So pattern set is 10x faster to check than a regexp on your
example.That's useful to know. But we'd need the overall
performance gain, since checking multi-value regexp may
be a small fraction of what LT does. I see many regexp
though with ? [..] Sometimes they can still be transformed
into multivalue, but not all the time. Example:
les?|la|du|des|aux?
could be written as:
le|les|la|du|des|au|aux
Regexp used to find possible multivalue tokens:
<(token|exception)[\s\w"=]* regexp="yes"[\s\w"=]>[
a-zA-Z0-9,|-]</(token|
This does not find multivalues with diacritics or non ASCII char for
example.
So you get 0 for Russian or Chinese (and smaller numbers for
French and Esperanto, etc.)
Another thing to keep in mind: I think that some regexp are
"hotter" than others. What I mean by that, is that some regexp
are checked far more often than others. I suppose that the
first token of rule is checked more often than other tokens
because of its fails to match, we don't even bother
checking the 2nd, 3rd (etc) tokens. So intuitively, the regexp
in the first token need to be more optimized than in the
other tokens. Interesting might be to add debug counters to
find how many time each regexp is checked. It's mostly "hot"
regexp (checked many time) that deserve to be optimized.
Dominique
from languagetool.
Yes, the search for regexp tokens was quite naive. Just a first approximation for rough numbers. The difference between Pattern
and Set
might be 10 to 1, but take care of the scale. It takes at least one million loop runs to measure a difference in this test.
Do you know of any better, practical tests that already exist? Maybe the language-specific PatternRuleTest
s?
from languagetool.
LT has a profiling mode that you can run from the command line. It runs the same rule(s) and gives the average runtime.
from languagetool.
Does it make sense to keep this issue open? BTW, we'll have another optimization in 2.4 that helps with large rule sets and makes checking 20-30% faster.
from languagetool.
Well, AFAIK, no tests using the profiling mode have been made. Maybe we should keep it open?
from languagetool.
I haven't had time to perform the tests you suggested. And since we have another optimization (20-30%? that's a lot!), I think we can close this issue for now.
from languagetool.
By the way, I did some experiments by converting the regexes that already have disjunctions and it's just a little bit faster, but overall the speed increases (for Polish, on a 500 thousand word corpus) from 460 sentences to 470 sentences per minute. So this is not worth the effort.
from languagetool.
Marcin Miłkowski [email protected] wrote
By the way, I did some experiments by converting the regexes that already
have disjunctions and it's just a little bit faster, but overall the speed
increases (for Polish, on a 500 thousand word corpus) from 460 sentences to
470 sentences per minute.
Hi Marcin
Did you use a set to store the list of words in each disjunction?
Since what is performance critical is the lookup, a binary search in
a sorted array would be more efficient.
I don't code a lot in Java, but in C++ doing a binary search in a sorted
std::vector<> (~ ArrayList in Java) is at least an order of magnitude faster
than a find in a std::set<> (~ TreeSet in Java). Both of them are O(log(n))
of course, but:
- the std::vector<> is not only more compact in memory but all
elements are stored next to each other (as opposed to allocated
all over the place in the heap for the elements in std::set<>).
So in C++, the std::vector is more cache friendly than the std::set - Furthermore, a find in a a std::set has to follow pointers in
a red-black tree which is not as efficient as a simple binary
search in a std::vector.
Regards
Dominique
from languagetool.
Hi Dominique,
I used a HashSet, but it wouldn't matter much anyway, as these lists are usually quite small, so the linear search is fast anyway. I did some profiling and lowercasing took more time than the call to contains(). I could have precomputed the lowercase token (and use the precomputed version instead of equalsIgnoreCase) but overall, even if matching is gives maximum speed increase of 10% on rules with long disjunctions, it's negligible as compared with the rest of the rules. We can apply this as well but it only makes code slightly more complex, and harder to replace with a finite state machine matching (which won't be entirely possible anyway, I guess, because of the variables, or dynamic token references with elements in the rule).
Regards,
Marcin
from languagetool.
Related Issues (20)
- Дніпровщина
- [LO-Add-on] Pop-up menu using “info” icon instead of “question” icon — 2024-05-13 HOT 1
- [de] Bullet points that aren't a sentence must/should not be written in caps
- Libreoffice + LanguageTool 6.4 java.lang.NullPointerException: Cannot invoke org.languagetool.openoffice.CacheIO.setDocumentPath HOT 1
- Additional suggestion for IVE_I_HAVE_AMERICAN_STYLE rule
- error within vaadin text fields (text input fields vanish on click) HOT 3
- Request for Self-Hosted AI LLM Download Option
- case_sensitive='yes' doesn't work properly in antipattern — 2024-05-29
- ru: false positive with беспроводной
- [LO-Add-on] LanguageTool crashes on Impress — 2024-05-30 HOT 3
- New German suggestion(s) HOT 4
- [MacOS] Add an option to remove the app icon from the menu bar HOT 2
- [de] `Pepsinwein` erroneously marked as error
- Libre office Extension on Ubuntu 22.04
- Valencian language
- Languagetool extension 6.4 gives out an error message while saveing the file. HOT 1
- [en] LT from the command line: don't give useless message [enhancement request] HOT 1
- Phrases aren't matched by rules unless they are the provided example sentences HOT 1
- Languagetool vs WYSIWYG HOT 1
- Shadow DOM breaks CSS inherit HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from languagetool.