Hi, I think "multiLanguage" method doesn't work with Japanese. <div class="highlig

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

"multiLanguage" method doesn't work with Japanese. about lunr-languages HOT 8 OPEN

mihaivalentin commented on August 21, 2024 3

"multiLanguage" method doesn't work with Japanese.

from lunr-languages.

Comments (8)

biosocket commented on August 21, 2024 1

It seems like the Japanese segmenter is not running on the search string.
If you search for more than one Japanese word without a space between the words (which is how the Japanese write, right?), no results are returned.

To clarify, when you index the sentence, この文章は日本語で書かれています, the segmenter breaks it down into:
い
て
ます
れ
文章
日本語
書か

But if you search "書かれ" without a space between the phrases, nothing is found.

from lunr-languages.

railsstudent commented on August 21, 2024

The stemmer function of Japanese looks like the following:
lunr.jp.stemmer = (function() {

        /* TODO japanese stemmer  */
        return function(word) {
            return word;
        }
    })();

Could it be the cause of empty array?

My work projects also requires Japanese search capability and the result is less than ideal.

from lunr-languages.

railsstudent commented on August 21, 2024

For Japanese, lunr behaves as exact search instead of index search.

from lunr-languages.

skoji commented on August 21, 2024

I found that use(lunr.multiLanguage('ja')); is worse than use(lunr.ja)

Here is a slightly modified version of @rikuson 's example
https://jsbin.com/hanacul/edit?html,output

from lunr-languages.

knubie commented on August 21, 2024

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

simonbate commented on August 21, 2024

I've attempted this fix, but it fails.
It's unclear why the example uses 'ru', where I would think we want 'ja' throughout.
It failed when I tried using 'ru' as indicated; it also fails when using 'ja' where it would seem more appropriate.

This is the error I get when generating the index:

/Users/simonfbate/node_modules/lunr/lunr.js:673
for (var j = 0; j < tokens.length; j++) {
^
TypeError: Cannot read properties of undefined (reading 'length')
at lunr.Pipeline.run (/Users/simonfbate/node_modules/lunr/lunr.js:673:32)
at lunr.Builder.add (/Users/simonfbate/node_modules/lunr/lunr.js:2482:31)
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:53:12)
at Array.forEach ()
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:52:15)
at lunr (/Users/simonfbate/node_modules/lunr/lunr.js:53:10)
at Socket. (/Users/simonfbate/Documents/!SimonWork/DITA-OT_/out/XXX_jp/ui/js/build_index.js:36:13)
at Socket.emit (node:events:532:35)
at endReadableNT (node:internal/streams/readable:1346:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21)

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

knubie commented on August 21, 2024

@simonbate Sorry, the ru in the example is probably a typo. I think I also forgot to add a return statement to the function (I normally write clojure and ruby which don't need returns, so that often trips me up).

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ja.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ja'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    return lunr.tokenizer(x).concat(lunr.ja.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

simonbate commented on August 21, 2024

THANK YOU! Works great now.

Simon

from lunr-languages.

"multiLanguage" method doesn't work with Japanese. about lunr-languages HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent