Git Product home page Git Product logo

Comments (4)

karussell avatar karussell commented on August 17, 2024

Hmmh, the plugin could be overhead as we only need decompounding for a small subset of the German language. Because it could influence the relevance negativly if we decompound 'baerwaldstraße' into "baer", "wald" and "straße". So we should just use the normal decompound stuff from elasticsearch and provide our own world list

from photon.

hbruch avatar hbruch commented on August 17, 2024

I explored elasticsearch's hyphenation_decompounder. Currently, there exist some issues which need further work:

  • ES currently requires an explicit dictionary (word_list) that must contain all subwords to be returned as tokens. Underlying lucene token filter does not, so a custom plugin that instantiates the token filter without word list would work (see this discussion).
  • the hyphenation token filter returns all subwords with offsets identical to the compound word, which results in treating all subwords as synonyms in the query phase. In consequence e.g. searching 'Erlangerstraße' would treat 'erlanger' and 'strasse' as synonyms, which is not intended (see same discussion).
  • lucene's hyphenation pattern handling has a bit shifting issue (see LUCENE-8124), so patterns are restricted to hyphenation markers 1 to 6. As we'd provide a custom hyphenation pattern file, so this just a minor, avoidable issue.

I'd prepare a WIP PR. The custom plugin should become an artifact on it's own, so what would you recommend: creating a github project on it's own? creating it as maven submodule?

from photon.

karussell avatar karussell commented on August 17, 2024

Thanks, sounds good to me! I'll then close #46?

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

creating it as maven submodule?

I would prefer this, yes.

from photon.

hbruch avatar hbruch commented on August 17, 2024

Thanks, sounds good to me! I'll then close #46?

Yes. I'll reuse the patterns in de.txt and see if applying the decompounder only for .de. helps.

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

Yes, internal changes are required to allow a null word_list tweak the subword offsets. To avoid JarHell exceptions, I won't patch the original code but copy and adapt. If I get things working, I'd submit them to ES/Lucene as well. But I've no idea, if hyphenation is really used in the ES/Lucene community. Current behavior looks too odd, IMHO.

from photon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.