Comments (8)
It seems like the Japanese segmenter is not running on the search string.
If you search for more than one Japanese word without a space between the words (which is how the Japanese write, right?), no results are returned.
To clarify, when you index the sentence, この文章は日本語で書かれています, the segmenter breaks it down into:
い
て
ます
れ
文章
日本語
書か
But if you search "書かれ" without a space between the phrases, nothing is found.
from lunr-languages.
The stemmer function of Japanese looks like the following:
lunr.jp.stemmer = (function() {
/* TODO japanese stemmer */
return function(word) {
return word;
}
})();
Could it be the cause of empty array?
My work projects also requires Japanese search capability and the result is less than ideal.
from lunr-languages.
For Japanese, lunr behaves as exact search instead of index search.
from lunr-languages.
I found that use(lunr.multiLanguage('ja'));
is worse than use(lunr.ja)
Here is a slightly modified version of @rikuson 's example
https://jsbin.com/hanacul/edit?html,output
from lunr-languages.
This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.
For the record this is how to work around the issue:
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);
var idx = lunr(function () {
// the reason "en" does not appear above is that "en" is built in into lunr js
this.use(lunr.multiLanguage('en', 'ru'));
// Compose the japanese tokenizer with the built-in tokenizer
this.tokenizer = function(x) {
lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
};
// then, the normal lunr index initialization
// ...
});
from lunr-languages.
I've attempted this fix, but it fails.
It's unclear why the example uses 'ru', where I would think we want 'ja' throughout.
It failed when I tried using 'ru' as indicated; it also fails when using 'ja' where it would seem more appropriate.
This is the error I get when generating the index:
/Users/simonfbate/node_modules/lunr/lunr.js:673
for (var j = 0; j < tokens.length; j++) {
^
TypeError: Cannot read properties of undefined (reading 'length')
at lunr.Pipeline.run (/Users/simonfbate/node_modules/lunr/lunr.js:673:32)
at lunr.Builder.add (/Users/simonfbate/node_modules/lunr/lunr.js:2482:31)
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:53:12)
at Array.forEach ()
at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:52:15)
at lunr (/Users/simonfbate/node_modules/lunr/lunr.js:53:10)
at Socket. (/Users/simonfbate/Documents/!SimonWork/DITA-OT_/out/XXX_jp/ui/js/build_index.js:36:13)
at Socket.emit (node:events:532:35)
at endReadableNT (node:internal/streams/readable:1346:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21)
This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.
For the record this is how to work around the issue:
var lunr = require('./lib/lunr.js'); require('./lunr.stemmer.support.js')(lunr); require('./lunr.ru.js')(lunr); require('./lunr.multi.js')(lunr); var idx = lunr(function () { // the reason "en" does not appear above is that "en" is built in into lunr js this.use(lunr.multiLanguage('en', 'ru')); // Compose the japanese tokenizer with the built-in tokenizer this.tokenizer = function(x) { lunr.ja.tokenizer(x).concat(lunr.tokenizer(x)); }; // then, the normal lunr index initialization // ... });
from lunr-languages.
@simonbate Sorry, the ru
in the example is probably a typo. I think I also forgot to add a return
statement to the function (I normally write clojure and ruby which don't need returns, so that often trips me up).
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ja.js')(lunr);
require('./lunr.multi.js')(lunr);
var idx = lunr(function () {
// the reason "en" does not appear above is that "en" is built in into lunr js
this.use(lunr.multiLanguage('en', 'ja'));
// Compose the japanese tokenizer with the built-in tokenizer
this.tokenizer = function(x) {
return lunr.tokenizer(x).concat(lunr.ja.tokenizer(x));
};
// then, the normal lunr index initialization
// ...
});
from lunr-languages.
THANK YOU! Works great now.
Simon
from lunr-languages.
Related Issues (20)
- Error when initializing `lunr.th.js` HOT 1
- Is this repository dead? HOT 1
- Add language support for Catalan
- How should I use it in ES6?
- lunr.zh.js can`t search 'database' HOT 4
- React JS
- Accented letter ê should be replaced by e in the french stemmer HOT 1
- Usage of Lunr js with Turkish language
- Minified version of the Thai language is missing HOT 2
- lunr-languages/lunr.fr.js fails to find common words like "équipement" HOT 3
- Can "nodejieba" be as "devDependency" or "peerDependency" in lunr-languages? HOT 4
- Update nodejs version in CI
- nodejieba (Chinese) is not working with webpack HOT 1
- lunr.de fails with umlaute in wildcard search
- Arabic HOT 1
- Search indexing for Chinese language (lunr.zh) does not work with multi-language HOT 4
- Idea: Use Intl.Segmenter to reduce bundle size in browsers
- Error when using lunr.zh.js 'nodejieba.cut is not a function' HOT 2
- Cannot read properties of undefined (reading 'registerFunction') HOT 1
- Add language support for Sanskrit, Kannada & Telugu HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lunr-languages.