Git Product home page Git Product logo

Comments (11)

nemphys avatar nemphys commented on August 18, 2024 1

@benwtrent OK, clear. Thank you for the detailed explanation!

from elasticsearch.

elasticsearchmachine avatar elasticsearchmachine commented on August 18, 2024

Pinging @elastic/es-search (Team:Search)

from elasticsearch.

benwtrent avatar benwtrent commented on August 18, 2024

I have been trying to replicate and I cannot.

Could you provide data & steps to replicate?

One thing to try is to use the _explain API, it will indicate if the value is in the top-k and if the similarity is within the configured similarity.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

GET search-index-test/_explain/<doc_id_that_matches_top_level_knn>
{
  "query": {
    "knn": {
      "field": "_text_embeddings.vector",
      "query_vector": [ .... ],
      "num_candidates": 10000,
      "similarity": "0.55"
    }
  }
}

from elasticsearch.

nemphys avatar nemphys commented on August 18, 2024

@benwtrent unfortunately it is not possible to share actual data, but I executed the suggested explain request with the same query vector against the first result of the top-level knn query (very handy btw, I was not aware of this functionality) and the response is as follows:

"explanation": {
    "value": 0,
    "description": "Failure to meet condition(s) of required/prohibited clause(s)",
    "details": [
      {
        "value": 0,
        "description": "no match on required clause (VectorSimilarityQuery[similarity=0.5, docScore=0.75, innerKnnQuery=DocAndScore[10000]])",
        "details": [
          {
            "value": 0,
            "description": "not in top 10000",
            "details": []
          }
        ]
      },
      {
        "value": 0,
        "description": "match on required clause, product of:",
        "details": [
          {
            "value": 0,
            "description": "# clause",
            "details": []
          },
          {
            "value": 1,
            "description": "FieldExistsQuery [field=_primary_term]",
            "details": []
          }
        ]
      }
    ]
  }

Are you sure that you tried with a nested type (as in the provided mapping example)?

I also tried lowering the query similarity parameter as low as 0.1, but the result is the same.

from elasticsearch.

nemphys avatar nemphys commented on August 18, 2024

I could also debug the actual code step-by-step with breakpoints if you point me to the right classes, it seems that somewhere along the path all search results in the second case get lost.

from elasticsearch.

benwtrent avatar benwtrent commented on August 18, 2024

@nemphys let me try to debug again with nested. I somehow skipped that in my initial reading of this issue.

from elasticsearch.

benwtrent avatar benwtrent commented on August 18, 2024

@nemphys I just noticed your nested knn query syntax. Could you try the same query but within a nested query? When you use knn as a query, you must go back to the typical nested query syntax stuff.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-knn-query.html#knn-query-with-nested-query

When using the top level knn object, we can infer easily that you want this to be within a nested context or not given the field.

But for the knn query, it could be combined with other nested or non-nested things, so its not as easily determinable the context you want the query to run in.

GET /search-index-test/_search
{
  "query": {
    "nested": {
      "path": "_text_embeddings",
      "query": {
        "knn": {
          "field": "_text_embeddings.vector",
          "query_vector": [ .... ],
          "num_candidates": 10000,
          "similarity": "0.55"
        }
      }
    }
  },
  "size": 10
}

from elasticsearch.

nemphys avatar nemphys commented on August 18, 2024

Right! I missed that because the knn query parameters are almost identical to the top-level knn (except for k) and there was no error thrown.

It now works as expected, producing the same results as the top-level knn query.

PS. Is there a reason why the top-level knn search is the "preferred" way to perform ANN search (according to the documentation)? Does the "normal" knn query have any disadvantages compared to the top-level one?

from elasticsearch.

benwtrent avatar benwtrent commented on August 18, 2024

@nemphys the difference has to do with the number of documents collected and scored. We consider it preferred as it provides the most consistent experience, just maybe not the most flexible or powerful one.

The top-level kNN utilizes the DFS phase to make sure it only counts the global top-k no matter the number of shards. This way, your total hit count won't change based on the number of shards.

For the kNN query, the hit count may vary by number of shards. The number of nearest neighbors returned will be the same. But, your total hit count is now num_candidates*shard instead of just num_candidates.

Here is an example:

  • Gathering 10 num_candidates from an index with 3 shards.
  • Top-level knn will indicate 10 total hits
  • query level knn will indicate 30 total hits.

from elasticsearch.

nemphys avatar nemphys commented on August 18, 2024

@benwtrent does this apply even if the size parameter is explicitly set? Ie. could a kNN query return 30 results if the size parameter is set to 10 (or will it just do the fetch/calculations for 30 and return the top 10)?

from elasticsearch.

benwtrent avatar benwtrent commented on August 18, 2024

@nemphys its not about the actual hits (nearest neighbors) returned, but the hits.total.value, the total hit count.

You will still only get the size results you want, and we will still only collect at most num_candidates per shard. The only difference is the total hit count provided.

from elasticsearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.