Multivalue tokens about languagetool HOT 17 CLOSED

languagetool-org commented on July 20, 2024

Multivalue tokens

from languagetool.

Comments (17)

danielnaber commented on July 20, 2024

We should first measure if this really helps before we introduce any complexity.

from languagetool.

dpelle commented on July 20, 2024

Stefan Lotties [email protected] wrote:

I've noticed that quite a few tokens in the grammar.xml (at least in the
german one) use regexp-tokens but just define a list of possible values,
e.g. like this:

an|auf|durch|für|gegen|über

It would be nice to introduce multivalue-tokens for such cases, maybe like
this:

an|auf|durch|für|gegen|über

Either multivalue or regexp may be defined, but not both.

I think it would be the best to store the token values as Set because
Element.isStringTokenMatched() must simply use Set.contains() then, which
is much faster than a regular expression that just holds a list of possible
values.

It's not clear if existing rules should be changed or if there's a
mechanism that checks the token value and interprets it as a multivalue
token instead of a regular expression.

It's already known that most of the time is spent in the method for
matching rules (see New Pattern Matching @
http://wiki.languagetool.org/missing-features), so this improvement will
not change the world, but will slightly improve the performance.

Constraints:

the old way must still work

The question is: will it be 10% faster, or 0.01% faster?

I'm guessing that the speed gain will negligible and if so, it's not worth.
Regexp can be surprisingly fast once compiled (but admittedly the
multi-value token can cause lots of backtracking so can be slow).
If we can prove that it will speed up (with a prototype), then I'm all for
it.

I think that there is more room for improvement elsewhere.
For example, I only glanced at the code and it seems that
thread load balancing is done by giving each thread
more or less the same number of rules to process. Correct?
However, the Hunspell rule for example are much slower
than the other grammar rules. So thread having to check
Hunspell (+ other grammar rules) will be a bottleneck and other
threads will wait for it. I may be wrong since I don't know this
code well, so it's just a hunch. Having a queue and having threads
in pool pick rules to process from a queue, rather than assigning
fixed number of rules per thread may improve load balancing
and thus speed up LT hopefully significantly. Or less rules could
be assigned to thread doing Hunspell. I remember seeing
that CPU utilization was much lower than the number of cores
on my machine, so it seems that something can be improved
there.

Dominique

from languagetool.

danielnaber commented on July 20, 2024

Dominique, you're right, it would make more sense to feed the rules to free threads instead of splitting them into groups before the work begins. Another thing that could be parallelized is text analysis (splitting sentences into words, assigning POS tags).

from languagetool.

danielnaber commented on July 20, 2024

It's probably as easy as this: http://stackoverflow.com/questions/9916264/java-thread-simple-queue/9916299#9916299, but I just tried it and it makes checking actually slower. Probably because creating all those Callables also comes with some overhead.

from languagetool.

slotties commented on July 20, 2024

I just did a quick performance test on Pattern vs. Set.

Code:

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;


public class Test {
  public static void main(String[] args) {
    String[] tokensToTest = {
        // Hits
        "a",
        "de",
        "pa",
        "so",
        "con",
        "en",
        "per",
        "por",
        "tou",
        "que",
        // Misses
        "abc",
        "def",
        "ghi",
        "x",
        "y",
        "z",
        "0",
        "1",
        "2",
        "test"
    };

    Pattern pattern = Pattern.compile("a|de|pa|so|con|en|per|por|tou|que");
    Set<String> values = new HashSet<>(); 
    values.add("a");
    values.add("de");
    values.add("pa");
    values.add("so");
    values.add("con");
    values.add("en");
    values.add("per");
    values.add("por");
    values.add("tou");
    values.add("que");

    int testRuns = 1000000;

    long t0_set = System.currentTimeMillis();
    for (int i = 0; i < testRuns; i++) {
      for (String token : tokensToTest) {
        values.contains(token);
      }
    }
    long t1_set = System.currentTimeMillis();

    long t0_pattern = System.currentTimeMillis();
    for (int i = 0; i < testRuns; i++) {
      for (String token : tokensToTest) {
        pattern.matcher(token).matches();
      }
    }
    long t1_pattern = System.currentTimeMillis();

    System.out.printf("Set test    : %d\n", t1_set - t0_set);
    System.out.printf("Pattern test: %d\n", t1_pattern - t0_pattern);
  }
}

On my Intel Core 2 Duo (3 GHz), Java 7 (u40, 64 Bit, no special JVM parameters) the results are:

Set test    : 311
Pattern test: 3423

Then I searched for all regexp tokens and exceptions.

Regexp used to find regexp tokens: <(token|exception)[\s\w"=]* regexp="yes"[\s\w"=]>
Regexp used to find possible multivalue tokens: <(token|exception)[\s\w"=] regexp="yes"[\s\w"=]>[ a-zA-Z0-9,|-]</(token|exception)

Here's a list of all languages that have at least 200 regexp tokens defined. I guess all languages with less than this amount are not relevant to this improvement. All XML files of a language were scanned (grammar.xml, disambiguation.xml, bitext.xml).

Language     | # of regexp tokens | # of multivalue tokens
-------------|--------------------|--------------------
br (disamb.) |   38               | 5
br (grammar) |  762               | 85
ca (disamb.) |  475               | 171
ca (grammar) | 1180               | 325
de (disamb.) |    1               | 0
de (grammar) | 2045               | 458
en (disamb.) |   77               | 30
en (grammar) |  583               | 417
eo (disamb.) |    4               | 0
eo (grammar) |  243               | 28
fr (disamb.) |   89               | 26
fr (grammar) | 2966               | 657
nl (disamb.) |    4               | 0
nl (grammar) |  287               | 133
pl (disamb.) |   52               | 6
pl (bitext)  |    2               | 2
pl (grammar) |  745               | 111
ro (grammar) |  336               | 137
ru (disamb.) |   18               | 0
ru (bitext)  |    1               | 1
ru (grammar) |  235               | 3
zh (grammar) |  201               | 0

That is much less than I expected. The major languages here (ca, de, fr) have like 1/4 to 1/3 possible multivalue tokens.

from languagetool.

slotties commented on July 20, 2024

I expected this change to be measurable. But with these results I guess there won't be any benefit unless you do thousands of checks in one of the languages with big rule files. And even then the difference cannot be felt, but just measured with a very small difference.

from languagetool.

slotties commented on July 20, 2024

@dpelle , @danielnaber
Parallelizing the rule processing might work for standalone applications or server farms which are designed to "map and reduce", but in old fashioned servers you cannot parallelize as much as you want to. In such scenarios LanguageTool is just a minor part of the whole system and therefore should use as few resources as possible.
From my experience such servers have a small amount of CPUs (or cores), typically up to 16, and they have to serve many HTTP requests already (in the old-fashioned one-thread-per-request model).

btw, the standalone application is using the MultithreadedJLanguageTool, right? The combination might become a problem then.

from languagetool.

milekpl commented on July 20, 2024

@slotties From your estimate, it seems that multivalue tokens are processed ten times quicker as Sets. If this is correct, then we'd expect 10-fold speedup for 1/3 regexp rules. I think it's worth trying. Also, remember that some multivalue tokens contain additional regexes for speedup (like 'a[bc]|ada|foobars?') - if we linearize such regexes, there might be more multivalue tokens to start with. Disjunction is THE bottleneck of regexes.

The reason why I think it's worth trying is that I had a fairly long multivalue token in one of the rules (comprising around 100 items) and removing it (by adding a new POS tag) made a significant speed increase.

Anyway, for short text, LT is already quite fast.

from languagetool.

dpelle commented on July 20, 2024

Stefan Lotties [email protected] wrote:

I just did a quick performance test on Pattern vs. Set.

Code:

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;

public class Test {
public static void main(String[] args) {
String[] tokensToTest = {
// Hits
"a",
"de",
"pa",
"so",
"con",
"en",
"per",
"por",
"tou",
"que",
// Misses
"abc",
"def",
"ghi",
"x",
"y",
"z",
"0",
"1",
"2",
"test"
};
Pattern pattern = Pattern.compile("a|de|pa|so|con|en|per|por|tou|que");
Set<String> values = new HashSet<>();
values.add("a");
values.add("de");
values.add("pa");
values.add("so");
values.add("con");
values.add("en");
values.add("per");
values.add("por");
values.add("tou");
values.add("que");

int testRuns = 1000000;

long t0_set = System.currentTimeMillis();
for (int i = 0; i < testRuns; i++) {
  for (String token : tokensToTest) {
    values.contains(token);
  }
}
long t1_set = System.currentTimeMillis();

long t0_pattern = System.currentTimeMillis();
for (int i = 0; i < testRuns; i++) {
  for (String token : tokensToTest) {
    pattern.matcher(token).matches();
  }
}
long t1_pattern = System.currentTimeMillis();

System.out.printf("Set test    : %d\n", t1_set - t0_set);
System.out.printf("Pattern test: %d\n", t1_pattern - t0_pattern);
}
}

On my Intel Core 2 Duo (3 GHz), Java 7 (u40, 64 Bit, no special JVM
parameters) the results are:

Set test : 311
Pattern test: 3423

Then I searched for all regexp tokens and exceptions.

Regexp used to find regexp tokens: <(token|exception)[\s\w"=]*
regexp="yes"[\s\w"=]>
Regexp used to find possible multivalue tokens:
<(token|exception)[\s\w"=] regexp="yes"[\s\w"=]>[
a-zA-Z0-9,|-]</(token|exception)

Here's a list of all languages that have at least 200 regexp tokens
defined. I guess all languages with less than this amount are not relevant
to this improvement. All XML files of a language were scanned (grammar.xml,
disambiguation.xml, bitext.xml).

Language # of regexp tokens # of multivalue tokens

br (disamb.) 38 5

br (grammar) 762 85

ca (disamb.) 475 171

ca (grammar) 1180 325

de (disamb.) 1 0

de (grammar) 2045 458

en (disamb.) 77 30

en (grammar) 583 417

eo (disamb.) 4 0

eo (grammar) 243 28

fr (disamb.) 89 26

fr (grammar) 2966 657

nl (disamb.) 4 0

nl (grammar) 287 133

pl (disamb.) 52 6

pl (bitext) 2 2

pl (grammar) 745 111

ro (grammar) 336 137

ru (disamb.) 18 0

ru (bitext) 1 1

ru (grammar) 235 3

zh (grammar) 201 0

That is much less than I expected. The major languages here (ca, de, fr)
have like 1/4 to 1/3 possible multivalue tokens.

Language	# of regexp tokens	# of multivalue tokens
br (disamb.)	38	5
br (grammar)	762	85
ca (disamb.)	475	171
ca (grammar)	1180	325
de (disamb.)	1	0
de (grammar)	2045	458
en (disamb.)	77	30
en (grammar)	583	417
eo (disamb.)	4	0
eo (grammar)	243	28
fr (disamb.)	89	26
fr (grammar)	2966	657
nl (disamb.)	4	0
nl (grammar)	287	133
pl (disamb.)	52	6
pl (bitext)	2	2
pl (grammar)	745	111
ro (grammar)	336	137
ru (disamb.)	18	0
ru (bitext)	1	1
ru (grammar)	235	3
zh (grammar)	201	0

So pattern set is 10x faster to check than a regexp on your
example.That's useful to know. But we'd need the overall
performance gain, since checking multi-value regexp may
be a small fraction of what LT does. I see many regexp
though with ? [..] Sometimes they can still be transformed
into multivalue, but not all the time. Example:

les?|la|du|des|aux?

could be written as:

le|les|la|du|des|au|aux

Regexp used to find possible multivalue tokens:
<(token|exception)[\s\w"=]* regexp="yes"[\s\w"=]>[
a-zA-Z0-9,|-]</(token|

This does not find multivalues with diacritics or non ASCII char for
example.

So you get 0 for Russian or Chinese (and smaller numbers for
French and Esperanto, etc.)
Another thing to keep in mind: I think that some regexp are
"hotter" than others. What I mean by that, is that some regexp
are checked far more often than others. I suppose that the
first token of rule is checked more often than other tokens
because of its fails to match, we don't even bother
checking the 2nd, 3rd (etc) tokens. So intuitively, the regexp
in the first token need to be more optimized than in the
other tokens. Interesting might be to add debug counters to
find how many time each regexp is checked. It's mostly "hot"
regexp (checked many time) that deserve to be optimized.

Dominique

from languagetool.

slotties commented on July 20, 2024

Yes, the search for regexp tokens was quite naive. Just a first approximation for rough numbers. The difference between Pattern and Set might be 10 to 1, but take care of the scale. It takes at least one million loop runs to measure a difference in this test.

Do you know of any better, practical tests that already exist? Maybe the language-specific PatternRuleTests?

from languagetool.

milekpl commented on July 20, 2024

LT has a profiling mode that you can run from the command line. It runs the same rule(s) and gives the average runtime.

from languagetool.

danielnaber commented on July 20, 2024

Does it make sense to keep this issue open? BTW, we'll have another optimization in 2.4 that helps with large rule sets and makes checking 20-30% faster.

from languagetool.

milekpl commented on July 20, 2024

Well, AFAIK, no tests using the profiling mode have been made. Maybe we should keep it open?

from languagetool.

slotties commented on July 20, 2024

I haven't had time to perform the tests you suggested. And since we have another optimization (20-30%? that's a lot!), I think we can close this issue for now.

from languagetool.

milekpl commented on July 20, 2024

By the way, I did some experiments by converting the regexes that already have disjunctions and it's just a little bit faster, but overall the speed increases (for Polish, on a 500 thousand word corpus) from 460 sentences to 470 sentences per minute. So this is not worth the effort.

from languagetool.

dpelle commented on July 20, 2024

Marcin Miłkowski [email protected] wrote

By the way, I did some experiments by converting the regexes that already
have disjunctions and it's just a little bit faster, but overall the speed
increases (for Polish, on a 500 thousand word corpus) from 460 sentences to
470 sentences per minute.

Hi Marcin

Did you use a set to store the list of words in each disjunction?
Since what is performance critical is the lookup, a binary search in
a sorted array would be more efficient.

I don't code a lot in Java, but in C++ doing a binary search in a sorted
std::vector<> (~ ArrayList in Java) is at least an order of magnitude faster
than a find in a std::set<> (~ TreeSet in Java). Both of them are O(log(n))
of course, but:

the std::vector<> is not only more compact in memory but all
elements are stored next to each other (as opposed to allocated
all over the place in the heap for the elements in std::set<>).
So in C++, the std::vector is more cache friendly than the std::set
Furthermore, a find in a a std::set has to follow pointers in
a red-black tree which is not as efficient as a simple binary
search in a std::vector.

Regards
Dominique

from languagetool.

milekpl commented on July 20, 2024

Hi Dominique,

I used a HashSet, but it wouldn't matter much anyway, as these lists are usually quite small, so the linear search is fast anyway. I did some profiling and lowercasing took more time than the call to contains(). I could have precomputed the lowercase token (and use the precomputed version instead of equalsIgnoreCase) but overall, even if matching is gives maximum speed increase of 10% on rules with long disjunctions, it's negligible as compared with the rest of the rules. We can apply this as well but it only makes code slightly more complex, and harder to replace with a finite state machine matching (which won't be entirely possible anyway, I guess, because of the variables, or dynamic token references with elements in the rule).

Regards,
Marcin

from languagetool.

Multivalue tokens about languagetool HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent