Comments (5)
i probably need to evaluate the mysql fulltext syntax/behaviour once again.
If i remember correctly, removing the +
led to strange mysql results sometimes (imo).
That said, adding an option to remove it wouldn't break things for anyone.
However, the option would probably be an all-or-nothing one, such that you can only fully disable all +
operators.
from search_cop.
Hi, thanks for your feedback!
There is no stop word handling implemented within search_cop currently and no DMBS related code regarding stopwords either. So, if MySQL isn't handling the stopwords properly, you first need to double-check the MySQL docs to fix it (if possible). I'd guess the +
operators prevent MySQL from removing them, but the +
operators are neccessary and can't be removed.
Personally, i don't want the DMBS/Searchengine/etc. to handle synonyms/stopwords/etc, as this often fails short when i18n or multi-word synonyms/stopwords are desired. Thus, i would do synonym/stopword mapping before passing the query to search cop as well. However, search_cop maybe could provide a synoym/stopword mapping engine:
search_scope :search do
attributes :title, :description
stopwords do |phrase|
redis.sismember("stopwords:en", phrase)
end
end
such that phrase
will be removed from the query in case the stopword is member of the redis stopword set for english. However, maybe you could double check the MySQL docs and document you findings first.
from search_cop.
I'd guess the
+
operators prevent MySQL from removing them
That's correct. The search query entered is in this example is stranger in a strange land
. The same query done in SQL without the +
preceding the stopwords works fine.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.01 sec)
However, search_cop maybe could provide a synoym/stopword mapping engine:
Agreed. Some sort of an optional pull in makes sense as different users may have different needs. It would certainly be very helpful if the users could simply set it up.
At the moment I have fixed it as below in the application controller code as shown below.
@posts = Post.search(remove_stopwords(params[:search][:query]))
# Private method much later ....
# SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
STOP_WORDS = %w{ a about an are as at be by com de en for from how i
in is it la of on or that the this to was what when where who will
with und the www }.freeze
def remove_stopwords(query)
words = query.split(/\s+/).select do |w|
(w.length >= 3) && (w.length <= 84) && !STOP_WORDS.include?(w)
end
words.join(' ')
end
Regarding the length checks above, The MySQL manual here has a note for InnoDB stating:
By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index.
So, if the user were to put in x
, it would become +x
and cause the search to fail even though it's not a stop word. Hence, the length checks.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +x +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)
The stop words for InnoDB are pull using the query mentioned on the page. The page also mentions MyISAM stopwords.
from search_cop.
Just another thought from a solution perspective. Is it possible to disable the +
being added via an option? I would think this should work for InnoDB and MyISAM any perhaps PostgreSQL without having to muck around with stopwords, language specific stop words etc.
Without the plus, the query works just fine.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
This would make the query more flexible and the user could add +
to specific words. At the moment the user needs to be right about each words, however, most of the time the user may be OK with the highest number of matches rather than all.
It may be OK to give results without foo
if the search query is stranger in a strange foo
.
mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange foo' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
from search_cop.
Thank you for considering this and for search_cop
from search_cop.
Related Issues (20)
- globalize with search_cop - unknown attribute HOT 2
- Support for PostgreSQL ENUM HOT 3
- Matching multiple in a has_many HOT 1
- multiple where conditions and concatenated with AND HOT 3
- Attribute (case insensitive / ILIKE) HOT 3
- generator for attribute options HOT 3
- Error when using with non-standard pluralization in class name HOT 2
- DEPRECATION WARNING HOT 2
- How to handle special search logic / virtual attributes HOT 4
- Suggestion: Option for `exact: true` (or `right_wildcard: false`) HOT 1
- Can search_cop work with Postgres hstore? HOT 8
- Suggestion: Add support for sort in the free text search HOT 4
- Searching for empty or non empty with the free text expression HOT 2
- RAILS 6 - Search in ActionText HOT 3
- Search word that include % does not work correctly HOT 2
- Ability to apply Custom Operators or Options to entire project HOT 1
- ActiveRecord::ConfigurationError in v1.2.1 HOT 10
- README error? HOT 6
- Words containing brackets HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from search_cop.