opensearch-project / neural-search Goto Github PK
View Code? Open in Web Editor NEWPlugin that adds dense neural retrieval into the OpenSearch ecosytem
License: Apache License 2.0
Plugin that adds dense neural retrieval into the OpenSearch ecosytem
License: Apache License 2.0
Is your feature request related to a problem?
Following a problem in another plugin (opensearch-project/opensearch-build#2043), and coming from opensearch-project/opensearch-build#58, there's no automated testing that this plugin runs as part of the OpenSearch distribution.
What solution would you like?
Run integration tests as part of the distribution.
Perform the benchmarks for the Query via the new "neural" query type.
This will provide insights around performance of the new Query type (neural) that we are adding in OpenSearch. We will be using Open Search Benchmarks to perform this.
Metrics Identified:
This benchmark will concrete results which we have got from experiments that we have done by combining the BM-25 and K-NN score separately. This will also provide insights around if we need to boost scores for one query type or not and when to boost it.
Metrics Identified:
Science Experiment Metrics
Release Version 2.5.0
This is a component issue for 2.5.0.
Coming from opensearch-build#2908. Please follow the following checklist.
Please refer to the DATES / CAMPAIGNS in that post.
This issue captures the state of the OpenSearch release, on component/plugin level; its assignee is responsible for driving the release. Please contact them or @mention them on this issue for help.
Any release related work can be linked to this issue or added as comments to create visiblity into the release status.
There are several steps to the release process; these steps are completed as the whole component release and components that are behind present risk to the release. The component owner resolves the tasks in this issue and communicate with the overall release owner to make sure each component are moving along as expected.
Steps have completion dates for coordinating efforts between the components of a release; components can start as soon as they are ready far in advance of a future release. The most current set of dates is on the overall release issue linked at the top of this issue.
Linked at the top of this issue, the overall release issue captures the state of the entire OpenSearch release including references to this issue, the release owner which is the assignee is responsible for communicating the release status broadly. Please contact them or @mention them on that issue for help.
If including changes in this release, increment the version on 2.5
branch to 2.5.0
for Min/Core, and 2.5.0.0
for components. Otherwise, keep the version number unchanged for both.
v2.5.0
.2.5
branch2.5.0
are complete.2.5.0
release branch in the distribution manifest.2.5.0
have been merged.Not able to build and work on the main branch of the repo.
Pull the neural search main branch and try running ./gradlew run command.
./gradlew run should be successful.
Mac OS/
NA
This could be related to the breaking changes added in the OpenSearch by moving org.opensearch.common.xcontent.XContentBuilder to org.opensearch.core.xcontent.XContentBuilder.
Currently, the neural-search plugin search functionality relies on the user to pass the "model_id" with each query.
"query": {
"neural": {
"<vector_field>": {
"query_text": "hello world",
"model_id": "csdsdcsdsadasdcsad",
"k": int
}
}
This offers a suboptimal user experience. The model IDs are randomized strings that add confusion to a given query. Additionally, search behavior has to change when the model is updated (the ID needs to be updated). While it may be possible to come up with some kind of alias scheme for the model ID (see opensearch-project/ml-commons#554), the best user experience would be for the user writing the query to not need to know any details about the model_id.
We want to offer a user experience like this:
"query": {
"neural": {
"<vector_field>": {
"query_text": "hello world",
"k": int
}
}
Similarly, for indexing, the same information could be used if no model id is specified, so the experience would look like:
PUT _ingest/pipeline/<pipeline_name>
{
"description": "string",
"processors" : [
{
"text_embedding": {
"field_map": {
"<input_field_name>": "<output_field_name>",
...
}
}
},
...
]
}
In this option, we would associate the model mapping in a field in the _meta field of the index.
PUT my-neural-index
{
"mappings": {
"_meta": {
"neural_field_map": {
"field_name_1": "<model_id>",
"field_name_2": "<model_id>",
"field_name_3": "<model_id>",
}
}
}
}
Similar to _meta field, we could make the map an index setting (would need to validate that settings can in fact be maps). Index settings would give us more control over validation of input model ids as well as some hooks to trigger actions when settings are updated.
PUT my-neural-index/_settings
{
"index": {
"neural_field_map": {
"field_name_1": "<model_id>",
"field_name_2": "<model_id>",
"field_name_3": "<model_id>",
}
}
}
Using a system index is another approach to associating this model information with a given index. However, maintaining a system index is heavier than relying on a _meta field. This would require several APIs to manage the functionality. If we are to create a system index for model management, it would be better to group this functionality with ml-commons, which already has a model system index (see next option).
Another alternative is to delegate functionality of associating a model with an index/field/function to the model management apis of ml-commons. In this solution, users would provide metadata during upload about what indices/fields/functions to associate a model with. This has the benefit of providing users the ability of abstracting all model management (including association) to ml-commons apis.
Currently, I am on the fence between the approaches of 1, 2 and 4 as my preferred solution and am looking for feedback on this. Additionally, if there are other alternative approaches we should consider, please post them here.
This issue describes details of Low Level Query clause Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of high-level design [RFC] High Level Approach and Design For Normalization and Score Combination is highly recommended. We expect antecedent design for API to be published soon, for now will be referencing to its Draft version.
As per HLD and API LLD in scope of this feature we need a new Query clause in OpenSearch. This new Query will fetch results at a shard level for different sub-queries during the Query phase of request handling. Query results will be processed in a later phase by a new processor on coordinator node. Proposed name for new Query is "Hybrid".
At each shard execution will be independent from other shards. Focus of the change is to get all tops scores for each sub-query, all packing and reduce will happen in later stages.
New Query will be added as part of the Neural Search plugin and most of the code changes will be done in the plugin repo.
Different sub-queries should be abstracted and not be limited by particular query types like k-NN or text match based on term or keywords.
New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level.
In this document we propose a solution for the questions below:
Following items will be covered in other design documents (etc. API Design Design):
New query Hybrid will be registered in a Neural Search plugin using new QueryBuilder class. Builder class will create a new Query class implementation that will have logic to execute each sub-query and combine weights per sub-query at a shard level. New query needs a doc collector that will process results of each sub-query and get top x results (top docs) at shard level. This information should clearly identify to which sub-query each result belongs. Metrics like max and min query can be added if needed.
These results will be used by Query Phase Searcher to pack and send shard results to coordinator node for normalization and score combination.
Overall class structure is very similar to a DisjunctionMax query that is part of Lucene Defining _search API Input(#1): Score Combination and Normalization for Semantics Search[HLD]
Feature will be available to users of a Neural Search plugin, that is experimental. Once user enables the plugin Normalization and Score Combination feature became available automatically.
In this design we will not create new DTO object to store scores of individual sub-queries. This is required for doc collector and will be added later together with custom QueryPhaseSearcher implementation. For this implementation we will use existing core DTOs, this allows to collect and return scores from only first sub-query.
In initial implementation phase we are skipping pagination for query results. Mainly this is based on complexity of implementation, and foreseen performance overhead for some query type. For instance, k-NN query (which is a base query for semantic search in neural-search plugin) must collect not only last “page” in results, but also all previous pages (e.g. to get results 60-80 k-NN will select first 80 results and ignore 0-60 results). Such approach is very inefficient and is breaking functional requirement for minimal added latency.
In initial implementation phase we are skipping query explain. Feature will be released in experimental mode and we want to make it stable before providing details on how query results are formed.
In initial implementation phase we’ll be using default sequential sequence of execution for sub-queries.
We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryBuilder. Builder class creates instance of HybridQuery that will encapsulate logic for getting results for each of sub-queries. Part of the Query responsibility is creation of Weight and Scorer that both provide scores for results of each sub-query at the shard level.
All the classes and logic above are agnostic to a type of sub-query, this allows to target one of the functional requirements about being query type agnostic
Below is the general data flow for getting query results for a new Hybrid Query. This represents a single shard, same execution happens for every shards in index.
For multiple sub-queries query supports json array
PUT <index-name>/_search
{
"query": {
"hybrid": {
"queries": [
{ /* neural query */ }, // this is added for an example
{ /* standard text search */ } // If a user want to boost some scores or update
// the scores we need to go ahead and do it in this query clause
]
}
}
}
Single sub-query can be passed as a json object
PUT <index-name>/_search
{
"query": {
"hybrid": {
"queries":
{ /* neural query */ }, // this is added for an example
// the scores we need to go ahead and do it in this query clause
}
}
}
Parses input and produces instance of Query class. For parsing of each sub-query we using using existing core logic from AbstractQueryBuilder.
Class has collection of query builders for each sub-query
Rewrites each of sub-queries.
Class has collection of Query for each sub-query
Constructs weight object for each sub-query. Return HybridScorer object that has scorers for each sub-query.
Throws exception if get Explain is called.
Responsible for iterating over results of each sub-query in desc order. Keeps the priority queue of doc id. For each next doc id get score from each sub-query. Sub-query scores are stored in array in a same order they were in the input query, it allows to map sub-query to its score.
Registers Hybrid query and returns collection of QuerySpes for the NeuralSearch plugin.
Main implementation details related to potential security threats:
Query is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.
Tests will be focused on overall query stability. Below are main test cases that will be covered:
Mentioned tests are part of the plugin repo CI and also can be executed on demand from development environment.
Tests for metrics like score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.
The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: tar
Architecture: x64
Platform: linux
Please check the logs: https://build.ci.opensearch.org/job/integ-test/4534/display/redirect
* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)
Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274
When all embedding fields not shown in the document or doing a partial update without embedding fields, an IllegalArgumentException will raise as below:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "empty docs"
}
],
"type": "illegal_argument_exception",
"reason": "empty docs"
},
"status": 400
}
This issue is reported here: https://forum.opensearch.org/t/feedback-neural-search-plugin-experimental-release/11501/8.
First to create an ingest pipeline and an index as below:
PUT _ingest/pipeline/nlp-pipeline
{
"description": "An example neural search pipeline",
"processors": [
{
"text_embedding": {
"model_id": "kI6NhoQB3oLQzIJTkldg",
"field_map": {
"text": "text_knn"
}
}
}
]
}
PUT /my-nlp-index-1
{
"settings": {
"index.knn": true,
"default_pipeline": "nlp-pipeline"
},
"mappings": {
"properties": {
"passage_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
},
"passage_text": {
"type": "text"
}
}
}
}
PUT my-nlp-index-1/_doc/1
{
"name": "doc1"
}
PUT my-nlp-index-1/_doc/1
{
"text": "doc1"
}
Then do a partial update to this document:
PUT my-nlp-index-1/_doc/1
{
"name": "doc2"
}
No exception raise and document should be ingested successfully.
Irrelevant with environment.
When defining a field_map containing nested fields, the pipeline fails to compute embeddings.
With the following configuration, using non-nested field-types, embeddings are computed:
PUT /_ingest/pipeline/neural_pipeline
{
"description": "Neural Search Pipeline for message content",
"processors": [
{
"text_embedding": {
"model_id": "SXXx8YUBR2ZWhVQIkghB",
"field_map": {
"message": "message_embedding"
}
}
}
]
}
PUT /neural-test-index
{
"settings": {
"index.knn": true,
"default_pipeline": "neural_pipeline"
},
"mappings": {
"properties": {
"message_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"engine": "lucene"
}
},
"message": {
"type": "text"
},
"color": {
"type": "text"
}
}
}
}
POST /_bulk
{"create":{"_index":"neural-test-index","_id":"0"}}
{"message":"Text 1","color":"red"}
{"create":{"_index":"neural-test-index","_id":"1"}}
{"message":"Text 2","color":"black"}
GET /neural-test-index/_search
DELETE /neural-test-index
With the following configuration using a nested source field, embeddings are not computed:
PUT /_ingest/pipeline/neural_pipeline_nested
{
"description": "Neural Search Pipeline for message content",
"processors": [
{
"text_embedding": {
"model_id": "SXXx8YUBR2ZWhVQIkghB",
"field_map": {
"message.text": "message_embedding"
}
}
}
]
}
PUT /neural-test-index-nested
{
"settings": {
"index.knn": true,
"default_pipeline": "neural_pipeline_nested"
},
"mappings": {
"properties": {
"message_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"engine": "lucene"
}
},
"message.text": {
"type": "text"
},
"color": {
"type": "text"
}
}
}
}
POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}
GET /neural-test-index-nested/_search
The neural ingestion pipeline should be able to handle nested fields.
docker image: opensearchproject/opensearch:2.5.0
The models referenced above were uploaded with the following configuration:
{
"name": "all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "sentence transformers model",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers"
},
"url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}
Copying the customer request from Forum post: https://forum.opensearch.org/t/extending-neural-search-pipeline-to-named-entity-recognition-and-other-metadata-extracting-models/13078
I have a usecase to involve a named entity recognition model for documents and queries while indexing and querying. The documents will be filtered based on the presence of extracted entities against the query’s extracted entities. The pipeline will work similar to the existing neural search pipeline with one difference that in this usecase, the queries and documents will be passed through a NER (Named entity recogntion) model and added with extra metadata such as entities instead of vectors provided by an embedding model.
So if we are able to extend the usecase of neural-search pipeline to include model(s) that enable named entities extraction, embeddings, image segments (finding image components for image search) etc., so that the query/document extracts enough metadata through various models in the list of my neural search pipeline before matching.
Please do a +1 if you are looking for this feature. If possible do a comment explaining your usecase.
Received Error: Error building neural-search, retry with: ./build.sh manifests/2.7.0/opensearch-2.7.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/7395/consoleFull
Create a new query type: "neural". Internally, it will use ML-Commons to take a query string and create a vector from it. From there, it should build a k-NN query.
Interface will look like:
GET <index_name>/_search
{
"size": int,
"query": {
"neural": {
"<vector_field>": {
"query_text": "string",
"model_id": "string",
"k": int
}
}
}
}
vector_field — Field to execute k-NN query against
query_text — (string) Query text to be used to produce queries against.
model_id — (string) ID of model to do vector to encoding.
k — (int) Number of results to return from the k-NN search
For more details, refer to #11 (comment)
BM25 works well in exact match use cases and k-NN score works well in understanding context and getting relevant documents. It is important to get benefits from both of these relevancy mechanisms and one could achieve by combining these scores. One caveat is scores are on different scales and hence some kind of normalization is required.
High Level Tasks:
Follow opensearch-project/.github#125 to baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions.
Close this issue when:
If this repo's permissions was already baselined, please confirm the above when closing this issue.
Coming from opensearch-build#3185
We are de-coupling the task of publishing the maven snapshots from centralized build workflow to individual repositories. What this means is each repo can now publish maven snapshots using GitHub Actions.
This change unblocks the dependent components from waiting for a successful build before they can consume the snapshots. This also ensures that all snapshots are independent and up to date.
publishing
in build.gradle: repositories {
maven {
name = "Snapshots"
url = "https://aws.oss.sonatype.org/content/repositories/snapshots"
credentials {
username "$System.env.SONATYPE_USERNAME"
password "$System.env.SONATYPE_PASSWORD"
}
}
}
./gradlew publishPluginZipPublicationToSnapshotsRepository
.Please feel free to reach out to @opensearch-project/engineering-effectiveness.
This is a component issue for 2.6.0.
Coming from opensearch-build__3081__. Please follow the following checklist.
Please refer to the DATES in that post.
This issue captures the state of the OpenSearch release, on component/plugin level; its assignee is responsible for driving the release. Please contact them or @mention them on this issue for help.
Any release related work can be linked to this issue or added as comments to create visiblity into the release status.
There are several steps to the release process; these steps are completed as the whole component release and components that are behind present risk to the release. The component owner resolves the tasks in this issue and communicate with the overall release owner to make sure each component are moving along as expected.
Steps have completion dates for coordinating efforts between the components of a release; components can start as soon as they are ready far in advance of a future release. The most current set of dates is on the overall release issue linked at the top of this issue.
Linked at the top of this issue, the overall release issue captures the state of the entire OpenSearch release including references to this issue, the release owner which is the assignee is responsible for communicating the release status broadly. Please contact them or @mention them on that issue for help.
If including changes in this release, increment the version on __2.x__
branch to __2.6.0__
for Min/Core, and __2.6.0.0__
for components. Otherwise, keep the version number unchanged for both.
v2.6.0
.__2.6.0__
are complete.__2.6__
release branch in the distribution manifest.__2.6.0__
have been merged.Make the NeuralSearch class implement the Extensible interface.
Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.
As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.
By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.
Here is an example where we have the issue of pagination inconsistency when we use the general solution:
Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization:
x_normalized = (x – min) / (max – min)
and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.
This is a component issue for 2.7.0.
Coming from opensearch-build#3230. Please follow the following checklist.
Please refer to the DATES in that post.
This issue captures the state of the OpenSearch release, on component/plugin level; its assignee is responsible for driving the release. Please contact them or @mention them on this issue for help.
Any release related work can be linked to this issue or added as comments to create visiblity into the release status.
There are several steps to the release process; these steps are completed as the whole component release and components that are behind present risk to the release. The component owner resolves the tasks in this issue and communicate with the overall release owner to make sure each component are moving along as expected.
Steps have completion dates for coordinating efforts between the components of a release; components can start as soon as they are ready far in advance of a future release. The most current set of dates is on the overall release issue linked at the top of this issue.
Linked at the top of this issue, the overall release issue captures the state of the entire OpenSearch release including references to this issue, the release owner which is the assignee is responsible for communicating the release status broadly. Please contact them or @mention them on that issue for help.
If including changes in this release, increment the version on __2.x__
branch to __2.7.0__
for Min/Core, and __2.7.0.0__
for components. Otherwise, keep the version number unchanged for both.
v2.7.0
.2.7.0
are complete.2.7
release branch in the distribution manifest.2.7.0
have been merged.As multiple developers develop on this repo, adding a style check can help maintain consistency of coding standards. Im proposing we add spotless style check like OpenSearch core or the k-NN plugin (https://github.com/opensearch-project/k-NN/blob/main/gradle/formatting.gradle)
Ensure MAJOR_VERSION.x
branch exists, the main
branch acts as source of truth effectively working on 2 versions at the same time.
opensearch-project/opensearch-plugins#142
Currently plugins follow a branching strategy where they work on main
for the next development iteration, effectively working on 2 versions at the same time. This is not always true for all plugins, the release branch or branch pattern is not consistent, the lack of this standardization would limit multiple automation workflows and alignment with core repo. More details on META ISSUE
Follow OpenSearch core branching. Create 1.x
and 2.x
branches, do not create 2.0
as a branch of main, instead create main -> 2.x -> 2.0
. Maintain working CI for 3 releases at any given time.
A customer got an error while trying out the neural plugin. Steps followed
Forum link: https://forum.opensearch.org/t/feedback-neural-search-plugin-experimental-release/11501/4
I got an error with the example in the documentation. Below is the code I tried. I tested the model and it is loading and working fine. However, ingesting the document fails giving the below error!
Environment: This was done on windows in development mode (one node as cluster_manager, data, ingest and ml)
Error:
{
“error” : {
“root_cause” : [
{
“type” : “illegal_argument_exception”,
“reason” : “empty docs”
}
],
“type” : “illegal_argument_exception”,
“reason” : “empty docs”
},
“status” : 400
}
My Code:
POST /_plugins/_ml/models/_upload
{
“name”: “all-MiniLM-L6-v2”,
“version”: “1.0.0”,
“description”: “test model”,
“model_format”: “TORCH_SCRIPT”,
“model_config”: {
“model_type”: “bert”,
“embedding_dimension”: 384,
“framework_type”: “sentence_transformers”
},
“url”: “[https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true 1](https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true)”
}
POST /_plugins/_ml/models/kI6NhoQB3oLQzIJTkldg/_load
POST /_plugins/_ml/models/kI6NhoQB3oLQzIJTkldg/_predict
{
“text_docs”:[ “today is sunny”]
}
PUT _ingest/pipeline/nlp-pipeline
{
“description”: “An example neural search pipeline”,
“processors” : [
{
“text_embedding”: {
“model_id”: “kI6NhoQB3oLQzIJTkldg”,
“field_map”: {
“text”: “text_knn”
}
}
}
]
}
PUT /my-nlp-index-1
{
“settings”: {
“index.knn”: true,
"default_pipeline": "nlp-pipeline"
},
"mappings": {
"properties": {
"passage_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
},
"passage_text": {
"type": "text"
}
}
}
}
POST my-nlp-index-1/_doc
{
“passage_text”: “Hello world”
}
Traditionally, OpenSearch has relied on keyword matching for search result ranking. From a high level, these ranking techniques work by scoring documents based on the relative frequency of occurrences of the terms in the document compared with the other documents in the index. One shortcoming of this approach is that it can fail to understand the surrounding context of the term in the search.
With recent advancements in natural language understanding, language models have become very adept at deriving additional context from sentences or passages. In search, the field of dense neural retrieval (referred to as neural search) has sprung up to take advantage of these advancements (here is an interesting paper on neural search in Open Domain Question answering). The general idea of dense neural retrieval is to, during indexing, pass the text of a document to a neural-network based language model, which produces a dense vector(s) and index these dense vectors into a vector search index. Then, during search, pass the text of the query to the model, which again produces a dense vector and execute a k-NN search with this dense vector against the dense vectors in the index.
For OpenSearch, we have created (or are currently creating) several building blocks to support dense neural retrieval: fast and effective k-NN search can be achieved using the Approximate Nearest Neighbor algorithms exposed through the k-NN plugin; transformer based language models will be able to be uploaded into OpenSearch and used for inference through ml-commons.
However, given that these are building blocks, setting them up to achieve dense neural retrieval can be complex. For example, to use k-NN search, you need to create your vectors somewhere. To use ml-commons neural-network support, you need to create a custom plugin.
In all use cases, we assume that the user knows what language model they want to use to vectorize their text data.
User understands how they want their documents structured and what fields they want vectorized. From here, they want an easy way to provide text-to-be-vectorized to OpenSearch and then not have to work with vectors directly for the rest of the process (indexing, or search).
User wants OpenSearch to handle all vectorization during indexing. However, for search, to minimize latencies, they want to generate vectors offline and build their own queries directly.
User already has index configured for search. However, they want OpenSearch to handle vectorization during search.
We will create a new OpenSearch plugin that will lower the barrier of entry for using neural search within the OpenSearch ecosystem. The plugin will host all functionality needed to provide neural search, including ingestion APIs/tools and also search APIs/tools. The plugin will rely on ml-commons for model management (i.e. upload, train, delete, inference). Initially, the plugin will focus on automatic vectorization of documents during ingestion as well as a custom query type API that vectorizes a text query into a k-NN query. The high level architecture will look like this:
For indexing, the plugin will provide a custom ingestion processor that will allow users to convert text fields into vector fields during ingestion. For search, the plugin will provide a new query type that can be used to create a vector from user provided query text.
The document ingestion will be implemented as an ingestion processor, which can be involved in customer defined ingestion pipelines.
The processor definition interface will look like this:
PUT _ingest/pipeline/pipeline-1
{
"description": "text embedding pipeline",
"processors": [
{
"text_embedding": {
"model_id": "model_1",
"field_map": {
"title": "title.knn",
"body": "body.knn"
}
}
}
]
}
model_id — the ID of model adopted in text embedding
field_map — the mapping of input / output fields. Output filed will hold the embedding of each input field.
In addition to the ingestion processor, we will provide a custom query type, “neural”, that will translate user provided text into a k-NN vector query using a user provided model_id.
The neural query can be used with the search API. The interface will look like
GET <index_name>/_search
{
"size": int,
"query": {
"neural": {
"<vector_field>": {
"query_text": "string",
"model_id": "string",
"k": int
}
}
}
}
vector_field — Field to execute k-NN query against
query_text — (string) Query text to be used to produce queries against.
model_id — (string) ID of model to do vector to encoding.
k — (int) Number of results to return from the k-NN search
Further, the neural query type can be used anywhere in the query DSL. For instance, it can be wrapped in a script score and used in a boolean query with a text matching query:
GET my_index/_search
{
"query": {
"bool" : {
"filter": {
"range": {
"distance": { "lte" : 20 }
}
},
"should" : [
{
"script_score": {
"query": {
"neural": {
"passage_vector": {
"query_text": "Hello world",
"model_id": "xzy76xswsd",
"k": 100
}
}
},
"script": {
"source": "_score * 1.5"
}
}
}
,
{
"script_score": {
"query": {
"match": { "passage_text": "Hello world" }
},
"script": {
"source": "_score * 1.7"
}
}
}
]
}
}
}
In the future, we will explore different ways to combine scores between BM25 and k-NN (see related discussion).
We appreciate any and all feedback the community has.
Specifically, we are particularly interested in information around the following topics
In order to develop against k-NN, we will need to depend on its produced jar artifact.
The task involves the setting up of the Neural Search plugin for development.
Received Error: Error building neural-search, retry with: ./build.sh manifests/2.4.0/opensearch-2.4.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/6267/consoleFull
As part of the discussion around implementing an organization-wide testing policy, I am visiting each repo to see what tests they currently perform. I am conducting this work on GitHub so that it is easy to reference.
Looking at the Neural Search repository, it appears there is
Repository | Unit Tests | Integration Tests | Backwards Compatibility Tests | Additional Tests | Link |
---|---|---|---|---|---|
Neural Search | Certificate of Origin, Link Checker, Benchmarking Tool (in progress) | #124 |
I don't see any requirements for code coverage in the testing documentation. If there are any specific requirements could you respond to this issue to let me know?
If there are any tests I missed or anything you think all repositories in OpenSearch should have for testing please respond to this issue with details.
The ask of this issue is to run the CI actions at a regular frequency for example: daily. This will make sure that if any of the dependency is breaking we are able to catch those error before releases.
Example:
Issues:
The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: tar
Architecture: arm64
Platform: linux
Please check the logs: https://build.ci.opensearch.org/job/integ-test/4482/display/redirect
* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)
Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274
For Normalization and Score Combination feature, we need a way to do the Normalization of scores received for different sub-queries from different shards at Coordinator node, before we can start combing the scores. This needs to be done after the Query phase is completed and before Fetch Phase is started in a _search api call.
The solution I am proposing is to extend the Search Pipeline Processor to create a new Processor Interface called as Search Phase Processors that can run between Phases of Search(There are many phases Search like DFS, Query, Fetch, Expand etc).
As first use-case we will create a new SearchPhase Processor that will run between Query and Fetch phase, which do the Normalization of Scores and combine the Scores after normalization.
The alternative approaches are defined in the Section: Obtaining Relevant Information for Normalization and score Combination in #126
Enable the OpenSearch Benchmarks in the Neural Search repo, so that we can do the benchmarking for Ingest, Query, Relevance.
Example: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb
Received Error: Error building neural-search, retry with: ./build.sh manifests/2.4.0/opensearch-2.4.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/6344/consoleFull
Coming from opensearch-project/opensearch-plugins#95, add Windows support.
As part of Neural Search plugin both ingestion and Search will require the inference API to convert the text or query string into embeddings by calling the right model. This issue tracks the development of Embedding module that will provide the abstraction in neural search plugin to call ML APIs via ML Client.
This issue talks about the various high level directions which are being proposed for Score combination and Normalization techniques to improve Semantic Search/Neural Search Queries in OpenSearch(META ISSUE: #123). The proposals tries to use already created extensions in OpenSearch as much as possible. Also, we try to make sure that directions we are choosing are long term; provides different level of customizations for users to tweak the Semantics Search as per their needs. The document also proposes phased design and implementation plan which will help us to improve and add new feature with every phase.
For information how normalization improves the overall Quality of Results please refer this OpenSearch Blog for Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/
For simplicity, let's us consider a 3 node OpenSearch cluster, where we have 2 data nodes and 1 coordinator node. The data node stores the data and coordinator node helps in running the request. This is how OpenSearch works on a very high level.
The customer here is referenced as the OpenSearch customers, who wants to use OpenSearch and wants to run Semantics Search in their application.
Currently, OpenSearch uses a Query and Fetch model to do the search. In this, first the whole query object is passed to all the shards, to obtain the top results from those shards and then fetch phase begins which fetches the relevant information for the documents. If we talk about 2 types of queries in a typical Semantic Search use case will have a K-NN query and text match query both of which produces scores using different scoring methods.
So at first, it is difficult to combine result sets whose scores are produced via different scoring methods. In order to effectively combine results, the different queries scores would need to be put on the same scale(refer https://arxiv.org/abs/2210.11934). By this, we mean is that a score would need to meet 2 requirements: (1) indicates its relative relevance between it and the other documents scored in the query and (2) be comparable with the relative relevance of results from other queries. For example, for k-NN, the score range may be 0-1 while BM25 scoring would be 0-Float.MAX. Hence any effective combination query clause like Bool will suffer the problems.
Second, it is not possible to consider global hits for re-ranking. Because scores are assigned at the shard level, any rescoring will be done at the shard level. Hence if we try to normalize the scores, the normalization will be local normalization and not the global normalization.
Let’s use below example to understand the problem in more details.
Using the same cluster setup as defined on top and look at an example to understand the above 2 problems. For this example lets assume that we have an index whose schema looks like this:
PUT product-info
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"description": {
"type": "text"
},
"tile_and_descrption_knn": {
"type": "knn_vector",
"dimension": 768
}
}
},
"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "2",
"knn": "true",
"number_of_replicas": "0",
"default_pipeline": "text-embedding-trec-covid"
}
}
}
The title and description are product title and description. The tile_and_descrption_knn field is a K-NN vector field which has a 768 dimensions dense vector created using Dense Vector Model.
Query
We are using Bool query clause to combine the results of K-NN(neural query converted to K-NN Query) and a text based search query. Bool Query should clauses have their scores combined — the more matching clauses, the better.
POST product-info/_search
{
"query" : {
"bool": {
"should": [
{
"multi-match": {
"query": "sofa-set for living room",
"fields": ["tile", "description"]
}
},
{
"neural": {
"tile_and_descrption_knn": {
"query_text": "sofa-set for living room",
"model_id": "dMGrLIQB8JeDTsKXjXNn"
}
}
}
]
}
}
}
Scores Computation Examples Happening at different Levels
Combination using above provided query: As we can see here because both K-NN and BM-25 scores are at different scale if one of the query behaves badly like example for document d8, it is still coming as the first response because of BM-25. The standard boolean combination is not taking the advantage of relative ranking. Documents like d7 are still lower down the results even when it has good scores in both BM-25 and K-NN. This problem will become more elevated as BM-25 scores are unbounded.
To solve this problem of joining scores for different queries running at different scale is to first Normalize the scores of both the queries and then combine the scores. We can read about the same here(https://arxiv.org/abs/2210.11934).
Using Normalization: To see how normalization works let’s look at the below table. In the below table we have done 2 types of normalization one which is done for every shard/data node and another at the Coordinator node(Global Normalization).
Note: I did the normalization using the scores present here hence documents have 0 scores after normalization.
Final Sorted Documents to be returned based on above examples:
If we keep our focus on the d8 document we can see how that document changes its position based on without normalization, with Local Normalization and with Global Normalization. As Local Normalization only considers the scores at per shard level, it can suffer in cases if one of the score is lowest. But in the Global Normalization(aka Normalization at Coordinator Node level), as we start to look scores from whole corpus it smoothens out the above problem, because it can happen that more bad scores can come from other shards. We did experiments on that to verify it.
Functional Requirements:
Good to Have but not as P0:
If we look at the requirements we see that we need solutions at different level of the whole search api flow. We can divide the whole flow in 3 parts:
The proposal for the input is to use the _search api and define a new compound query clause. This new compound query clause will hold the array of Queries which will be executed in parallel at per data node level. The name of the new query clause is not yet defined. The interface of this new query clause is inspired from dis_max query clause. But dis_max query clause runs the queries sequentially. The new query clause will make sure that Scores are calculated at shard level independently for each sub query. The sub-queries re-writing will be done at Coordinator level to avoid duplicate computations.
Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach. first
PUT <index-name>/_search
{
"query": {
"<new-compound-query-clause>": {
"queries": [
{ /* neural query */ }, // this is added for an example
{ /* standard text search */ } // If a user want to boost some scores or update
// the scores he need to go ahead and do it in this query clause
],
... other things to be added and will come as part of next sections
}
}
}
Pros:
Cons:
Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:
Cons:
This section talks about how OpenSearch will get the relevant information required for doing the Normalization. For example purpose lets say customer has defined the min-max normalization so for every subquery we will need the min and max sore for the whole corpus.
OpenSearch during the query phase it uses a QueryPhaseSearcher class to do the Query and collect the documents at shard level using TopDocsCollector interface. There is no extension point present in the QueryPhaseSearcher to use a different implementation of TopDocsCollector. The only extension we have is that a plugin can define a new QueryPhaseSearcher implementation. So we will define a new QueryPhaseSearcher implementation which will implement a new TopDocsCollector interface at shard level to gather the relevant information for doing normalization.
Pros:
Cons:
Alternative1: Enhance DFS Query Search type or create a new Query Search type to get information for doing normalization
The default search type in OpenSearch is Query and the Fetch, which first query the results and then fetch the actual source from the shards. The DFS query is another search type which customer can put as a query params to change the search type. In DFS Query and Fetch, OS will first Prequery each shard asking about Term and Document frequencies and send this information to all the shards where scores are calculated using this global Term/Document Frequencies.
We can build similar to this, where we can do the pre-query to find the min-max scores from all the shards and then pass this information where each shard can do the normalization of the scores for each sub-query.
Pros:
Cons:
The idea here is to extend the Search Pipeline Request and Response transformers to create another type of Transformer which will be called after the query phase is completed. We will use this transformer interface to do the normalization and score combinations for the document ids returned from Query phase as per the user inputs. The transformed result will then be passed on the Fetch phase which will run as it is.
Below is the modified input of the above proposed API input. It adds relevant fields to do the Normalization and Score Combination.
Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach.
PUT /_search_processing/pipeline/my_pipeline
{
"description": "A pipeline that helps in doing the normalization",
"<in-between-query-fetch-phase-processor>": [
{
"normalizaton-processor": {
// we can bring in the normalization info from _search api to this place if required
// It will be discussed as part of LLD.
}
}
]
}
PUT <index-name>/_search?pipeline=my_pipeline
{
"query": {
"<new-compound-query-clause>": {
"queries": [
{ /* neural query */ }, // this is added for an example
{ /* standard text search */ } // If a user want to boost some scores or update
// the scores he need to go ahead and do it in this query clause
],
// The below items be a part of processor also
"normalization-technique" : "min-max" // min-max etc.., Optional Object
"combination" : {
"algorithm" : "harmonic-mean", // all the values defined in #3 above, interleave, harmonic mean etc
"parameters" : {
// list of all the parameters that can be required for above algo
"weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
}
}
}
}
}
Alternative-1: Create a new phase in between query and fetch phase
The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:
Cons:
Alternative-2: Create a Fetch Subphase which do the Normalization and score combination
OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetchsubphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:
Cons:
Alternative-3: Use SearchOperationListeners interface
SearchOperationListener runs at a Shard Level and not the Coordinator Node Level. Please check this, and this code reference. Hence we cannot use SearchOperationListeners. As we need the normalization to be done at Coordinator Node Level.
Based on the above 3 directions, below is the high level flow diagram. The 2 subqueries are provided as an example. There will be a limit on how many subqueries a customer can define in the query clause. This max number for the subqueries we will be keeping is 10(There is no specific reason for this limit, just want to make sure that we have limits imposed to make sure to avoid long running queries leading to Circuit Breaker and Cluster failures).
** There can be many sub-queries in the new compound query.**
The proposed design doesn’t support for the Paginated Queries in the first phase to reduce the scope of the phase-1 launch. We also have not done deep-dive on how this can be implemented and what is the current solution of pagination.
With the phase-1 implementation of the new query we won’t be providing the explain functionality for the query clause. Explain api provides information about why a specific document matches (or doesn’t match) a query. This is very useful api for customers to understand and do debugging.
The idea here is to enable the parallel search for all the different queries provided in the Compound Query to improve the performance for this query. Parallel Search on Segments is already in sandbox for OpenSearch(https://github.com/opensearch-project/OpenSearch/tree/main/sandbox/plugins/concurrent-search).
The initial proposal provides customers only a set of functions to combine the scores with Script based combination customer can define custom scripts that can be used to combine the scores before we rank them.
The idea here is the new compound query clause can become overwhelming, hence we want to integrate it with different Query writing helpers like Querqy etc to facilitate easy query writings for customers.
Below are some high level phased approach for building the feature. These are not set in stone and may change as we start making progress in the implementation
Given that we are defining new compound query clause for OpenSearch we will launch these features defined in this document and high level design under feature flag. High Level items:
The phase-2 will focus on these items:
The phase-3 will focus on these items:
By this time we will have a good understanding of how customer are using this new compound query. This phase will start to focus on how we can now make it easier for customer to start using this new query clause. The below item helps us in doing that:
Yes, Elastic Search supports this feature. But it combines the results of K-NN query with the text match query only. It is not generic enough. Also ES doesn’t support normalization of scores globally. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/knn-search.html#_combine_approximate_knn_with_other_features.
As per the documentation limitation is:
Approximate kNN search always uses the dfs_query_then_fetch search type in order to gather the global top k matches across shards. You cannot set the search_type explicitly when running kNN search.
POST image-index/_search
{
"query": {
"match": {
"title": {
"query": "mountain lake",
"boost": 0.9
}
}
},
"knn": {
"field": "image-vector",
"query_vector": [54, 10, -2],
"k": 5,
"num_candidates": 50,
"boost": 0.1
},
"size": 10
}
After discussing, this is a very valid use case. But we don’t have enough data to prove this hypothesis. This will really depend on customer use case. I would suggest to start with lets do normalization on all the sub-queries as we can see from this blog that we should do normalization on all the sub-queries. Also as this is not a one way door.
We will have a min score from different shards but as of now we don’t have a way to find the global min score for the K-NN queries. To do that we need to run exact K-NN. As of now I am trying to find a way to do this. For text matching, as we will iterate over the all the segments we will have min score. But more deep-dive is required the feasibility of the solution.
Normalization is a data transformation process that aligns data values to a common scale or distribution of values.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.
1. y = (x – min) / (max – min)
Now you have 2 or more results. You need to combine both of them. There can be many way you can combine the results(Geometric Mean, Airthmatic Mean etc).
Approach 1: Normalized arithmetic mean
Assume we have 2 sets of results, resultsa and resultsb. Each result has a score and a document id. First, we will only consider the intersection of results in a and b (i.e. resultsc=resultsa∩resultsb). Then, each document in resultsc will have 2 scores: one from a and one from b. To combine the scores, we will first normalize all scores in resultsa and resultsb, and then take the arithmetic mean of them:
score=(norm(scorea)+norm(scoreb))/2
Approach 2: Normalized geometric mean
Similar to Approach 1, but instead of taking the arithmetic mean, we will take the geometric mean:
score=(norm(scorea)∗norm(scoreb))
Approach 3: Normalized harmonic mean
Similar to Approach1, but instead of taking the arithmetic mean, we will take the harmonic mean:
score=2/(1/norm(scorea)+1/norm(scoreb))
Approach 4: Normalized Weighted Linear Combination
Instead of taking the mean of the scores, we can just try different weights for each score and combine them linearly.
score=wa∗norm(scorea)+wb∗norm(scoreb)
Approach 5: Normalized Weighted Geometric Combination
Similar to above approach, but instead of combining with addition, we can combine with multiplication:
score=log(1+wa∗norm(scorea))+log(1+wb∗norm(scoreb))
This approach has previously been recommended for score combination with OpenSearch/Elasticsearch: elastic/elasticsearch#17116 (comment).
Approach 6: Interleave results
In this approach, we will produce the ranking by interleaving the results from each set together. So ranking 1, 3, 5, ... would come from resultsa and 2, 4, 6, ... would come from resultsb.
Related to opensearch-project/ml-commons#688, our tests fail because the default value for this setting is true: "plugins.ml_commons.only_run_on_ml_node"
We will need to update it before running tests with:
PUT /_cluster/settings
{
"persistent" : {
"plugins.ml_commons.only_run_on_ml_node" : false
}
}
The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: deb
Architecture: arm64
Platform: linux
Please check the logs: https://build.ci.opensearch.org/job/integ-test/4544/display/redirect
* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)
Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274
Before querying with neural search, all the documents should be ingested in the form of embedded vectors. Leaving the embedding process to the user offline will raise the bar of use. Thus we propose to implement the ingestion pipeline of document embedding. It consists of two parts:
model_id
will be dedicated for each processor and this kind of mapping can be retrieved by the querying phase.A new processor type needs to be created with name: text_embedding
, this processor has two parameters: model_id
and field_map
. Different models can produce different embedding results, user can specify a model_id
which is already uploaded and loaded, field_map
is the configuration user can specify which fields should apply text embedding in the ingestion pipeline. An example below:
PUT _ingest/pipeline/text-embedding-pipeline
{
"description": "Text embedding pipeline for several fields",
"processors": [
{
"text_embedding": {
"model_id": "WYjkv4MBHcWxVq8Jtc8U",
"field_map": {
"title": "title_knn",
"body_list": "body_list_knn",
"favorites": {
"game": "game_knn",
"movie": "movie_knn"
}
}
}
}
]
}
This issues proposes a new Processor Interface for Search Pipeline which will run in between the Phases of Search. This will allow plugins to transform the results retrieved from one phase before it goes to next phase at the Coordinator Node Level.
This RFC proposes a new set of APIs to manage Processors to transform Search Request and Responses in OpenSearch. The Search Pipeline will be used to create and define these processors. Example:
// Create/update a search pipeline.
PUT /_search/pipeline/mypipeline
{
"description": "A pipeline to apply custom synonyms, result post-filtering, an ML ranking model",
"request_processors" : [
{
"external_synonyms" : {
"service_url" : "https://my-synonym-service/"
}
},
{
"ml_ranker_bracket" : {
"result_oversampling" : 2, // Request 2 * size results.
"model_id" : "doc-features-20230109",
"id" : "ml_ranker_identifier"
}
}
],
"response_processors" : [
{
"result_blocker" : {
"service_url" : "https://result-blocklist-service/"
},
"ml_ranker_bracket" : {
// Placed here to indicate that it should run after result_blocker.
// If not part of response_processors, it will run before result_blocker.
"id" : "ml_ranker_identifier"
}
}
]
}
// Return identifiers for all search pipelines.
GET /_search/pipeline
// Return a single search pipeline definition.
GET /_search/pipeline/mypipeline
// Delete a search pipeline.
DELETE /_search/pipeline/mypipeline
Search API Changes
// Apply a search pipeline to a search request.
POST /my-index/_search?search_pipeline=mypipeline
{
"query" : {
"match" : {
"text_field" : "some search text"
}
}
}
// Specify an ad hoc search pipeline as part of a search request.
POST /my-index/_search
{
"query" : {
"match" : {
"text_field" : "some search text"
}
},
"pipeline" : {
"request_processors" : [
{
"external_synonyms" : {
"service_url" : "https://my-synonym-service/"
}
},
{
"ml_ranker_bracket" : {
"result_oversampling" : 2, // Request 2 * size results
"model_id" : "doc-features-20230109",
"id" : "ml_ranker_identifier"
}
}
],
"response_processors" : [
{
"result_blocker" : {
"service_url" : "https://result-blocklist-service/"
},
"ml_ranker_bracket" : {
// Placed here to indicate that it should run after result_blocker.
// If not part of response_processors, it will run before result_blocker.
"id" : "ml_ranker_identifier"
}
}
]
}
}
// Set default search pipeline for an existing index.
PUT /my-index/_settings
{
"index" : {
"default_search_pipeline" : "my_pipeline"
}
}
// Remove default search pipeline for an index.
PUT /my-index/_settings
{
"index" : {
"default_search_pipeline" : "_none"
}
}
// Create a new index with a default search pipeline.
PUT my-index
{
"mappings" : {
// ...index mappings...
},
"settings" : {
"index" : {
"default_search_pipeline" : "mypipeline",
// ... other settings ...
}
}
}
For Normalization and Score Combination feature, we need a way to do the Normalization of scores received for different sub-queries from different shards at Coordinator node, before we can start combing the scores. This needs to be done after the Query phase is completed and before Fetch Phase is started in a _search api call.
The proposed solution is to extend the Search Pipeline Processor Interface to create a new Processor Interface that can run between Phases. We will be onboarding Normalization use-case as the first use case for this processor which will run after Query Phase and Before Fetch Phase of Search Request.
The below flow chart assumes that the processor is running after Query Phase and before Fetch phase for search_type=query_then_fetch which is a default search type. But none of the interface assumes that these are the only 2 phases in the OpenSearch.
Note: Above diagram can be updated via this link.
interface SearchPhaseResultsProcessor extends Processor {
<Result extends SearchPhaseResult> void
process(final SearchPhaseResults<Result> results, final SearchPhaseContext context);
/**
This function is called by Search Pipeline Service before invoking the processor.
*/
<Result extends SearchPhaseResult> boolean shouldRunProcessor(
final SearchPhaseResults<Result> results,
final SearchPhaseContext context,
final SearchPhaseNames beforePhase,
final SearchPhaseNames nextPhase);
}
/**
Currently when we create phases we pass string as the phase name. This enum class
will allow us to define the phase names and use them at different places when required.
*/
// mark internal
public final enum class SearchPhaseNames {
// There are many more, I just added few here.
QUERY_PHASE("query"), FETCH_PHASE("query"), DFS_QUERY_PAHSE("dfs_query");
@Getter
String name;
}
SearchPhaseNames enum class will provide necessary abstraction and proper naming convention for OpenSearch phase names.
Pros:
Cons:
// Create/update a search pipeline.
PUT /_search/pipeline/my_pipeline
{
"description": "A pipeline that adds a Normalization and Combination Transformer",
"phase_results_processors" : [
{
"normalization-processor" : {
"technique": "min-max", // there can be others techniques. I know this only for now.
}
}
]
}
// all other APIs remain same for making this pipeline as default pipeline etc.
The idea here is that the SearchPipeline Service before invoking the execute function of a processor will check whether the condition to run the processor is met or not. The way it checks is it calls the getBeforePhases and getAfterPhases function and checks validates phase which got recently completed is in the the BeforeList and next phase which will be run is in AfterList or not.
interface SearchPhaseResultsProcessor extends Procesor {
<Result extends SearchPhaseResult> void
process(final SearchPhaseResults<Result> results, final SearchPhaseContext context);
/**
Returns a list of phases, before which this processor should be run.
*/
List<SearchPhaseNames> getBeforePhases();
/**
Returns a list of phases, after which this processor should be run.
*/
List<SearchPhaseNames> getAfterPhases();
}
/**
Currently when we create phases we pass string as the phase name. This enum class
will allow us to define the phase names and use them at different places when required.
*/
public final enum class SearchPhaseNames {
// There are many more, I just added few here.
QUERY_PHASE("query"), FETCH_PHASE("query"), DFS_QUERY_PAHSE("dfs_query");
@Getter
String name;
}
The Search Operation Listeners works at shard level and not the coordinator node level. We need to do the normalization at Coordinator node. Hence rejecting this solution. Please check this, and this code reference. It comes in the code path when Query is getting executed at Shard Level.
The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:
Cons:
OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetch subphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:
Cons:
Minor issue: https://github.com/opensearch-project/neural-search#opensearch-neural-search has a link to geospatial.
XContent namespace refactor from common -> core is going to be merged to opensearch/2.x which will break the 2.x build. This issue is for refactoring XContent imports from the common
to core
namespace after the core namespace change is merged.
Depends on opensearch-project/OpenSearch#6470
Release Version 2.4.0
This is a component issue for 2.4.0.
Coming from opensearch-build#2649. Please follow the following checklist.
Please refer to the DATES / CAMPAIGNS in that post.
This issue captures the state of the OpenSearch release, on component/plugin level; its assignee is responsible for driving the release. Please contact them or @mention them on this issue for help.
Any release related work can be linked to this issue or added as comments to create visiblity into the release status.
There are several steps to the release process; these steps are completed as the whole component release and components that are behind present risk to the release. The component owner resolves the tasks in this issue and communicate with the overall release owner to make sure each component are moving along as expected.
Steps have completion dates for coordinating efforts between the components of a release; components can start as soon as they are ready far in advance of a future release. The most current set of dates is on the overall release issue linked at the top of this issue.
Linked at the top of this issue, the overall release issue captures the state of the entire OpenSearch release including references to this issue, the release owner which is the assignee is responsible for communicating the release status broadly. Please contact them or @mention them on that issue for help.
If including changes in this release, increment the version on 2.0
branch to 2.4.0
for Min/Core, and 2.4.0.0
for components. Otherwise, keep the version number unchanged for both.
v2.4.0
.2.4
branch2.4.0
are complete.2.4.0
release branch in the distribution manifest.2.4.0
have been merged.Received Error: Error building neural-search, retry with: ./build.sh manifests/2.7.0/opensearch-2.7.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/7200/consoleFull
No
I would like to have a highlighter that supports the neural search capability. It should highlight the most relevant sentences in the neural search resulting documents.
There are no available alternatives at the moment. So the only choice is to develop one.
I tried to implement it myself but faced the following challenges:
@OverRide
public HighlightField highlight(FieldHighlightContext fieldContext) {
System.out.println("Query: "+fieldContext.context.query());
}
@OverRide
public HighlightField highlight(FieldHighlightContext fieldContext) {
System.out.println("highlighting..");
List responses = new ArrayList<>();
String queryText = get query text from fieldContext.context.query()
List<Float[]> embeddings = new ArrayList<>();
List<String> sentences= get sentences from search hit
sentences = query + sentences
List<List<Float>> vectors = clientAccessor.inferSentences("U3R9CYcBOk2JRjrls0nH", sentences);
for(List<Float> v:vectors)
{
List<Float> s = v;
embeddings.add(s.stream().toArray(Float[]::new));
}
System.out.println("Computing similarity");
double maxSim = 0;
String maxSentence = null;
if(embeddings.size()>0)
{
Float[] queryEmbedding = embeddings.get(0);
for(int i=1;i<embeddings.size();i++)
{
float sim = consineSim(queryEmbedding, embeddings.get(i));
set maxSim and maxSentence
}
}
responses.add(maxSentence);
return new HighlightField(fieldContext.fieldName, responses.toArray(new Text[] {}));
}
Having said the above, I hope that you tell what is the route to take here. Is this feature going to be available in the plugin any time soon?
Thanks
This is related to customer created Github issue: #109
The following configuration using a nested source field, embeddings are not computed, which should be supported:
PUT /_ingest/pipeline/neural_pipeline_nested
{
"description": "Neural Search Pipeline for message content",
"processors": [
{
"text_embedding": {
"model_id": "SXXx8YUBR2ZWhVQIkghB",
"field_map": {
"message.text": "message_embedding"
}
}
}
]
}
PUT /neural-test-index-nested
{
"settings": {
"index.knn": true,
"default_pipeline": "neural_pipeline_nested"
},
"mappings": {
"properties": {
"message_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"engine": "lucene"
}
},
"message.text": {
"type": "text"
},
"color": {
"type": "text"
}
}
}
}
POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}
GET /neural-test-index-nested/_search
The fields map keys should support . operator to define the nested fields.
Customer can create a nested field mapping using:
PUT /neural-test-index-nested
{
"description": "Neural Search Pipeline for message content",
"processors": [
{
"text_embedding": {
"model_id": "SXXx8YUBR2ZWhVQIkghB",
"field_map": {
"message": {
"text": "message_embedding"
}
}
}
}
]
}
Update Main branch of Plugin to point to 3.0.0 for OpenSearch and all the dependent plugins.
When:
script_score
neural
queries on multiple (different) vector fields, like in this commentscript
references _score
explain=true
Then, if a document is returned by some neural field queries (within the sub-query's top-k
) but not some others, the query fails with a script runtime exception and the error: Null score for the docID: 2147483647
(At least I think this is why... I'm new to OpenSearch and neural search, so apologies - my explanation for why this happens is just my best guess!)
title_embedding
and description_embedding
.GET /myindex/_search?explain=true
{
"from": 0,
"size": 100,
"query": {
"bool" : {
"should" : [
{
"script_score": {
"query": {
"neural": {
"title_embedding": {
"query_text": "test",
"model_id": "xGbq_YcB3ggx1CR0Nfls",
"k": 10
}
}
},
"script": {
"source": "_score * 1"
}
}
},
{
"script_score": {
"query": {
"neural": {
"description_embedding": {
"query_text": "test",
"model_id": "xGbq_YcB3ggx1CR0Nfls",
"k": 10
}
}
},
"script": {
"source": "_score * 1"
}
}
}
]
}
}
}
See an error like:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
"org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
"org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
"_score * 1",
"^---- HERE"
],
"script": "_score * 1",
"lang": "painless",
"position": {
"offset": 0,
"start": 0,
"end": 10
}
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "opensearch_content",
"node": "vnyA5s-aQUOmTj6IHosYXA",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
"org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
"org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
"_score * 1",
"^---- HERE"
],
"script": "_score * 1",
"lang": "painless",
"position": {
"offset": 0,
"start": 0,
"end": 10
},
"caused_by": {
"type": "runtime_exception",
"reason": "Null score for the docID: 2147483647"
}
}
}
]
},
"status": 400
}
Note the high size
and low k
. You might need to adjust the query_text
or k
to find a combination where a document is returned in one neural query's top k
and not the other.
Remove explain=true
from the query and notice it succeeds.
_score
for the affected field is 0 or the affected field is excluded entirely - either way, the _explanation
should accurately reflect this.OpenSearch 2.7, Ubuntu 22.04.
I'm not sure why it only happens with explain=true
. (I can't explain it)
It also only happens if using script_score
. If using multiple neural
queries directly, there is no error. But then there is no score per-field in _explanation
- the total is correct, but each field score value is reported as 1
. opensearch-project/k-NN#875 describes this problem. My use case is: I'd like to try using the similarity scores of each field as features in a Learning to Rank model, which means I need to get each score individually.
In some edge cases e.g. the instance down and the inference requests either to ingest or to query can fail, adding retry can relief this issue dramatically.
Add basic retry mechanism in neural search inference client.
Add complicated retry like backoff policy, jitter etc in the retry mechanism, but for our system it's an internal retry which means we know the how the retry will behave e.g. round robin or least load. So a basic retry is enough for our system.
NA
As a part of Ingestion and Search we are dependent on ML Commons Lib to create the embeddings for query string and user document and K-NN plugin for doing K-NN search. This task tracks the integration of ML commons lib in Neural-Search Plugin.
K-NN Plugin: https://github.com/opensearch-project/k-NN
ML Commons: https://github.com/opensearch-project/ml-commons
In 2.4.0, the OpenSearch k-NN plugin implemented filtering for the lucene engine for OpenSearch queries: opensearch-project/k-NN#519.
Given that the neural query is a wrapper around the k-NN query type, we can add this feature into neural search as well. We would add filter subobject to our current query type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.