Git Product home page Git Product logo

Comments (5)

mrkamel avatar mrkamel commented on May 26, 2024 1

i probably need to evaluate the mysql fulltext syntax/behaviour once again.
If i remember correctly, removing the + led to strange mysql results sometimes (imo).
That said, adding an option to remove it wouldn't break things for anyone.
However, the option would probably be an all-or-nothing one, such that you can only fully disable all + operators.

from search_cop.

mrkamel avatar mrkamel commented on May 26, 2024

Hi, thanks for your feedback!

There is no stop word handling implemented within search_cop currently and no DMBS related code regarding stopwords either. So, if MySQL isn't handling the stopwords properly, you first need to double-check the MySQL docs to fix it (if possible). I'd guess the + operators prevent MySQL from removing them, but the + operators are neccessary and can't be removed.

Personally, i don't want the DMBS/Searchengine/etc. to handle synonyms/stopwords/etc, as this often fails short when i18n or multi-word synonyms/stopwords are desired. Thus, i would do synonym/stopword mapping before passing the query to search cop as well. However, search_cop maybe could provide a synoym/stopword mapping engine:

search_scope :search do
  attributes :title, :description

 stopwords do |phrase|
    redis.sismember("stopwords:en", phrase)
  end
end

such that phrase will be removed from the query in case the stopword is member of the redis stopword set for english. However, maybe you could double check the MySQL docs and document you findings first.

from search_cop.

parthm avatar parthm commented on May 26, 2024

I'd guess the + operators prevent MySQL from removing them

That's correct. The search query entered is in this example is stranger in a strange land. The same query done in SQL without the + preceding the stopwords works fine.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.01 sec)

However, search_cop maybe could provide a synoym/stopword mapping engine:

Agreed. Some sort of an optional pull in makes sense as different users may have different needs. It would certainly be very helpful if the users could simply set it up.

At the moment I have fixed it as below in the application controller code as shown below.

 
    @posts = Post.search(remove_stopwords(params[:search][:query]))

   # Private method much later ....

    # SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
    STOP_WORDS = %w{ a about an are as at be by com de en for from how i
      in is it la of on or that the this to was what when where who will
      with und the www }.freeze
    def remove_stopwords(query)
      words = query.split(/\s+/).select do |w|
        (w.length >= 3) && (w.length <= 84) && !STOP_WORDS.include?(w)
      end
      words.join(' ')
    end

Regarding the length checks above, The MySQL manual here has a note for InnoDB stating:

By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index.

So, if the user were to put in x, it would become +x and cause the search to fail even though it's not a stop word. Hence, the length checks.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +x +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)

The stop words for InnoDB are pull using the query mentioned on the page. The page also mentions MyISAM stopwords.

from search_cop.

parthm avatar parthm commented on May 26, 2024

Just another thought from a solution perspective. Is it possible to disable the + being added via an option? I would think this should work for InnoDB and MyISAM any perhaps PostgreSQL without having to muck around with stopwords, language specific stop words etc.

Without the plus, the query works just fine.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)

This would make the query more flexible and the user could add + to specific words. At the moment the user needs to be right about each words, however, most of the time the user may be OK with the highest number of matches rather than all.

It may be OK to give results without foo if the search query is stranger in a strange foo.

mysql> SELECT  artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange foo' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name              | title                                                                |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge.                                     |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)

from search_cop.

parthm avatar parthm commented on May 26, 2024

Thank you for considering this and for search_cop 😃 . Just wanted to mention that the above behavior is for MySQL 8.0. I don't have access to MySQL 5.7 or another version to check if that's any different.

from search_cop.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.