Git Product home page Git Product logo

Comments (4)

olivernn avatar olivernn commented on August 21, 2024

The problem is where the list of stop words are declared

For reasons that I can't remember now the stopWordFilter uses a lunr.SortedSet as the datastructure for keeping track of the stop words. A key part of the SortedSet is that the set of values are sorted, the SortedSet usually maintains this characteristic, but they way the list of words are being loaded bypasses this check, setting SortedSet#elements directly.

Because the list is not sorted the SortedSet#indexOf check fails to correctly find the word "en" (and probably others also).

The quick fix for this is to always sort the list of stopWords, or ensure they are already sorted in the source file, see the english stop word filter for example.

The SortedSet could be entirely replaced by a JavaScript object with the stopWords as properties, being careful to not confuse other, inherited properties of a JavaScript object as stopWords (e.g. "valueOf")

A better, long term, fix would be for lunr to provide an easier way of creating custom stopWordLists, something with an API like this:

dutchStopWords = lunr.stopWordFilter('de en van ik te dat die in een hij het niet')

from lunr-languages.

marcselman avatar marcselman commented on August 21, 2024

I understand. I have changed the stopwords part in the dutch source file:

    // Looking at the original English stoplist the length should be equal to the length of the array minus 1.
    // I guess the empty element at the beginning of the array is not counted.
    lunr.du.stopWordFilter.stopWords.length = 101;
    lunr.du.stopWordFilter.stopWords.elements = [
        '',
        'aan',
        'al',
        'alles',
        'als',
        'altijd',
        'andere',
        'ben',
        'bij',
        'daar',
        'dan',
        'dat',
        'de',
        'der',
        'deze',
        'die',
        'dit',
        'doch',
        'doen',
        'door',
        'dus',
        'een',
        'eens',
        'en',
        'er',
        'ge',
        'geen',
        'geweest',
        'haar',
        'had',
        'heb',
        'hebben',
        'heeft',
        'hem',
        'het',
        'hier',
        'hij',
        'hoe',
        'hun',
        'iemand',
        'iets',
        'ik',
        'in',
        'is',
        'ja',
        'je',
        'kan',
        'kon',
        'kunnen',
        'maar',
        'me',
        'meer',
        'men',
        'met',
        'mij',
        'mijn',
        'moet',
        'na',
        'naar',
        'niet',
        'niets',
        'nog',
        'nu',
        'of',
        'om',
        'omdat',
        'onder',
        'ons',
        'ook',
        'op',
        'over',
        'reeds',
        'te',
        'tegen',
        'toch',
        'toen',
        'tot',
        'u',
        'uit',
        'uw',
        'van',
        'veel',
        'voor',
        'want',
        'waren',
        'was',
        'wat',
        'werd',
        'wezen',
        'wie',
        'wil',
        'worden',
        'wordt',
        'zal',
        'ze',
        'zelf',
        'zich',
        'zij',
        'zijn',
        'zo',
        'zonder',
        'zou'
    ];

This seems to work correctly. Maybe you can update your source files.

Thank you.

from lunr-languages.

MihaiValentin avatar MihaiValentin commented on August 21, 2024

Hi @marcselman ,

Thanks for pointing out the issue. I continued the discussion regarding this on the #8 and will fix as soon as possible.

Thanks!

from lunr-languages.

MihaiValentin avatar MihaiValentin commented on August 21, 2024

Fixed! Check #8 for more details.

from lunr-languages.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.