Git Product home page Git Product logo

Comments (8)

biosocket avatar biosocket commented on August 21, 2024 1

It seems like the Japanese segmenter is not running on the search string.
If you search for more than one Japanese word without a space between the words (which is how the Japanese write, right?), no results are returned.

To clarify, when you index the sentence, この文章は日本語で書かれています, the segmenter breaks it down into:


ます

文章
日本語
書か

But if you search "書かれ" without a space between the phrases, nothing is found.

from lunr-languages.

railsstudent avatar railsstudent commented on August 21, 2024

The stemmer function of Japanese looks like the following:
lunr.jp.stemmer = (function() {

        /* TODO japanese stemmer  */
        return function(word) {
            return word;
        }
    })();

Could it be the cause of empty array?

My work projects also requires Japanese search capability and the result is less than ideal.

from lunr-languages.

railsstudent avatar railsstudent commented on August 21, 2024

For Japanese, lunr behaves as exact search instead of index search.

from lunr-languages.

skoji avatar skoji commented on August 21, 2024

I found that use(lunr.multiLanguage('ja')); is worse than use(lunr.ja)

Here is a slightly modified version of @rikuson 's example
https://jsbin.com/hanacul/edit?html,output

from lunr-languages.

knubie avatar knubie commented on August 21, 2024

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

simonbate avatar simonbate commented on August 21, 2024

I've attempted this fix, but it fails.
It's unclear why the example uses 'ru', where I would think we want 'ja' throughout.
It failed when I tried using 'ru' as indicated; it also fails when using 'ja' where it would seem more appropriate.

This is the error I get when generating the index:

/Users/simonfbate/node_modules/lunr/lunr.js:673
    for (var j = 0; j < tokens.length; j++) {
                                           ^
TypeError: Cannot read properties of undefined (reading 'length')
at lunr.Pipeline.run (/Users/simonfbate/node_modules/lunr/lunr.js:673:32)
at lunr.Builder.add (/Users/simonfbate/node_modules/lunr/lunr.js:2482:31)
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:53:12)
at Array.forEach ()
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:52:15)
at lunr (/Users/simonfbate/node_modules/lunr/lunr.js:53:10)
at Socket. (/Users/simonfbate/Documents/!SimonWork/DITA-OT_/out/XXX_jp/ui/js/build_index.js:36:13)
at Socket.emit (node:events:532:35)
at endReadableNT (node:internal/streams/readable:1346:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21)

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

knubie avatar knubie commented on August 21, 2024

@simonbate Sorry, the ru in the example is probably a typo. I think I also forgot to add a return statement to the function (I normally write clojure and ruby which don't need returns, so that often trips me up).

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ja.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ja'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    return lunr.tokenizer(x).concat(lunr.ja.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});

from lunr-languages.

simonbate avatar simonbate commented on August 21, 2024

THANK YOU! Works great now.

Simon

from lunr-languages.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.