Git Product home page Git Product logo

Comments (12)

benwtrent avatar benwtrent commented on August 21, 2024 1

@atsushi-matsui I am still not understanding, could you give me a document you would expect to match and one that wouldn't with your most recent example (thus requiring the feature change)?

I am just trying to confirm the behavior as it still isn't clear to me how omitting a clause is any different than making that clause a match_all.

from elasticsearch.

benwtrent avatar benwtrent commented on August 21, 2024 1

@atsushi-matsui for your docs, what is the mapping configured? including any custom analyzers please.

Thank you for your patience :). Excluding vs. including vs. match_none vs. match_all is tricky to reason about.

from elasticsearch.

elasticsearchmachine avatar elasticsearchmachine commented on August 21, 2024

Pinging @elastic/es-search (Team:Search)

from elasticsearch.

benwtrent avatar benwtrent commented on August 21, 2024

Stop words are excluded by the token filter, so we expect zero hits, but all hits are returned

I don't understand this @atsushi-matsui . Omitting a clause is the same as now "matching all docs" given the clause.

In your first example, it seems the following would work fine:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Quick",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "the",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "Brown",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "Fox",
            "zero_terms_query", "all"
          }
        }
      ]
    }
  }
}

Then in your second example, omitting BOTH clauses (which is what would happen in this case), is the exact same as a match_all query. Consider the query:

"query": {"bool": {"must": []}}

That is the exact same as a match_all query.

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

@benwtrent
Thanks for the reply!!!

Then in your second example, omitting BOTH clauses (which is what would happen in this case), is the exact same as a match_all query. Consider the query:

I understand that the second example is equivalent to match_all, but there are cases where we want to omit the clause, so I'll show you another example.

When building a search system using Elasticsearch in Japan, it is common to prepare kuromoji and a 2-gram analyzer.
Here is a setting example.

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "search"
        },
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 2,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_baseform",
            "kuromoji_part_of_speech",
            "cjk_width",
            "stop",
            "kuromoji_stemmer",
            "lowercase"
          ]
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_ja": {
        "type": "text",
        "analyzer": "kuromoji_analyzer"
      },
      "text_cjk": {
        "type": "text",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

In Japan, it is common to search by entering phrases separated by spaces, so we can construct bool_query using words separated by spaces as phrases.
When we want to search for the anime "遊☆戯☆王", we may sometimes enter "遊 ☆ 戯 ☆ 王" separated by spaces.
At this time, if we include text_ja and text_cjk in the field and set zero_terms_query to all, all results will be hit, which is not a user-friendly result.

{
    "query": {
      "bool": {
        "must": [
          {
            "multi_match": {
              "query": "遊",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "☆",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "戯",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "☆",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "王",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          }
        ]
      }
    }
  }

If we omit the "☆" in our search, we may find works by "遊☆戯☆王".
Omitting "☆" is the same as removing the "☆" query and setting zero_terms_query to none, as shown below.

{
    "query": {
      "bool": {
        "must": [
          {
            "multi_match": {
              "query": "遊",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          },
          {
            "multi_match": {
              "query": "戯",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          },
          {
            "multi_match": {
              "query": "王",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          }
        ]
      }
    }
  }

Therefore, I would like bool_query to have a function that omits the clause.

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

The organization I work for is actually facing this problem.
Even if my proposal is not accepted, I would appreciate it if you could let me know if there is another solution!

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

@benwtrent
I'm sorry that the issue is difficult to understand.
I will try my best to convey it as accurately as possible.

Register the following data.
If a user searches for "遊☆戯☆王" and enters "遊 ☆," the search system should return only the document in Example 2-1.
If you set zero_terms_query to "all" as in Example 1-1, all documents will be returned, so this is not a desired result.
The cause is likely to be that 2-gram is set for text_cjk and match_all is returned.
If zero_terms_query is set to "none" as in Example 1-2, there will be 0 hits, which is also not a desired result.
The cause is likely to be 0 tokens in text_cjk.
In such a case, it is possible that the document in Example 2-1 can be obtained by omitting the "☆" character that causes the analyzer to set the number of tokens to 0.
In other words, this means that the search is performed only in the valid "遊" field in text_ja.

# queries
### Example 1-1
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "遊",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "all"
          }
        },
        {
          "multi_match": {
            "query": "☆",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "all"
          }
        }
      ]
    }
  }
}

### Example 1-2
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "遊",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "none"
          }
        },
        {
          "multi_match": {
            "query": "☆",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "none"
          }
        }
      ]
    }
  }
}
# documents
### Example 2-1
{
  "text_ja": "遊☆戯☆王",
  "text_cjk": "遊☆戯☆王",
  "release_date": "2023-01-01",
  "views": 123
}

### Example 2-2
{
  "text_ja": "ドラゴンボール",
  "text_cjk": "ドラゴンボール",
  "release_date": "2023-01-01",
  "views": 123
}

### Example 2-3
{
  "text_ja": "ナルト",
  "text_cjk": "ナルト",
  "release_date": "2023-01-01",
  "views": 123
}

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

If you set the query as "遊 ☆" in query_string as shown below, it will appear that the search is executed only for "遊".
Although it does not exist in the query_string option, if you check the source code, it appears that the "☆" is omitted because zero_terms_query is set to null.
I would like bool_query to provide a similar option.

{
  "query": {
    "query_string": {
      "query": "遊 ☆",
      "default_operator": "AND",
      "fields": ["text_ja", "text_cjk"], 
      "type": "phrase"
    }
  }
}

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

@benwtrent

for your docs, what is the mapping configured? including any custom analyzers please.

This is my setting used to confirm operation.

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "normal"
        },
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 2
        }
      },
      "analyzer": {
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_stemmer",
            "lowercase"
          ]
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_ja": {
        "type": "text",
        "analyzer": "kuromoji_analyzer"
      },
      "text_cjk": {
        "type": "text",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

I created a verification environment, so please use it if you like.
https://github.com/atsushi-matsui/sample-elastic

from elasticsearch.

atsushi-matsui avatar atsushi-matsui commented on August 21, 2024

Hi, @benwtrent.
I would like to know if there is any progress.

from elasticsearch.

elasticsearchmachine avatar elasticsearchmachine commented on August 21, 2024

Pinging @elastic/es-search-relevance (Team:Search Relevance)

from elasticsearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.