One more normalization has to be done to improve search. E.g. Erlangerstraße will be s

Thanks, sounds good to me! I'll then close <a class="issue-link js-issue-link" data-er

Thanks, sounds good to me! I'll then close <a class="issue-link js-issue-

Decompound words about photon HOT 4 OPEN

komoot commented on August 17, 2024

Decompound words

from photon.

Comments (4)

karussell commented on August 17, 2024

Hmmh, the plugin could be overhead as we only need decompounding for a small subset of the German language. Because it could influence the relevance negativly if we decompound 'baerwaldstraße' into "baer", "wald" and "straße". So we should just use the normal decompound stuff from elasticsearch and provide our own world list

from photon.

hbruch commented on August 17, 2024

I explored elasticsearch's hyphenation_decompounder. Currently, there exist some issues which need further work:

ES currently requires an explicit dictionary (word_list) that must contain all subwords to be returned as tokens. Underlying lucene token filter does not, so a custom plugin that instantiates the token filter without word list would work (see this discussion).
the hyphenation token filter returns all subwords with offsets identical to the compound word, which results in treating all subwords as synonyms in the query phase. In consequence e.g. searching 'Erlangerstraße' would treat 'erlanger' and 'strasse' as synonyms, which is not intended (see same discussion).
lucene's hyphenation pattern handling has a bit shifting issue (see LUCENE-8124), so patterns are restricted to hyphenation markers 1 to 6. As we'd provide a custom hyphenation pattern file, so this just a minor, avoidable issue.

I'd prepare a WIP PR. The custom plugin should become an artifact on it's own, so what would you recommend: creating a github project on it's own? creating it as maven submodule?

from photon.

karussell commented on August 17, 2024

Thanks, sounds good to me! I'll then close #46?

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

creating it as maven submodule?

I would prefer this, yes.

from photon.

hbruch commented on August 17, 2024

Thanks, sounds good to me! I'll then close #46?

Yes. I'll reuse the patterns in de.txt and see if applying the decompounder only for .de. helps.

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

Yes, internal changes are required to allow a null word_list tweak the subword offsets. To avoid JarHell exceptions, I won't patch the original code but copy and adapt. If I get things working, I'd submit them to ES/Lucene as well. But I've no idea, if hyphenation is really used in the ES/Lucene community. Current behavior looks too odd, IMHO.

from photon.

Decompound words about photon HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent