opensearch-project / neural-search Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 59.0 86.62 MB

Plugin that adds dense neural retrieval into the OpenSearch ecosytem

License: Apache License 2.0

Java 100.00%

neural-search's People

Contributors

Stargazers

Watchers

neural-search's Issues

[FEATURE] Run integration tests as part of the OpenSearch distribution

Is your feature request related to a problem?
Following a problem in another plugin (opensearch-project/opensearch-build#2043), and coming from opensearch-project/opensearch-build#58, there's no automated testing that this plugin runs as part of the OpenSearch distribution.

What solution would you like?
Run integration tests as part of the distribution.

Query Benchmarks for Neural Search

Description

Perform the benchmarks for the Query via the new "neural" query type.

Benchmarking Search API

This will provide insights around performance of the new Query type (neural) that we are adding in OpenSearch. We will be using Open Search Benchmarks to perform this.

Metrics Identified:

Total Latency(P90, P99), CPU utilization, Memory Space(Both Heap and Non Heap) for performing Y queries on the cluster.
Average Latency, Memory Space (Both Heap and Non Heap) taken to perform single query.
Performance of the predict API during Query. (We are mainly interested in Latency.)

Relevance of Search

This benchmark will concrete results which we have got from experiments that we have done by combining the BM-25 and K-NN score separately. This will also provide insights around if we need to boost scores for one query type or not and when to boost it.

Metrics Identified:

Measure the metrics(NDCG, Mean Average Precision, Relevancy etc.) that were captured during the Science Experiments

Appendix

Science Experiment Metrics

Normalized Discounted Cumulative Gain : Measure of ranking quality that factors in ordering.
Mean Average Precision : Measure that summarizes precision/recall curve.
Precision : What proportion of positive identifications was actually correct?
Recall : What proportion of actual positives was identified correctly?

Release Version 2.5.0

This is a component issue for 2.5.0.
Coming from opensearch-build#2908. Please follow the following checklist.
Please refer to the DATES / CAMPAIGNS in that post.

How to use this issue

This Component Release Issue

This issue captures the state of the OpenSearch release, on component/plugin level; its assignee is responsible for driving the release. Please contact them or @mention them on this issue for help.
Any release related work can be linked to this issue or added as comments to create visiblity into the release status.

Release Steps

There are several steps to the release process; these steps are completed as the whole component release and components that are behind present risk to the release. The component owner resolves the tasks in this issue and communicate with the overall release owner to make sure each component are moving along as expected.

Steps have completion dates for coordinating efforts between the components of a release; components can start as soon as they are ready far in advance of a future release. The most current set of dates is on the overall release issue linked at the top of this issue.

The Overall Release Issue

Linked at the top of this issue, the overall release issue captures the state of the entire OpenSearch release including references to this issue, the release owner which is the assignee is responsible for communicating the release status broadly. Please contact them or @mention them on that issue for help.

What should I do if my plugin isn't making any changes?

If including changes in this release, increment the version on 2.5 branch to 2.5.0 for Min/Core, and 2.5.0.0 for components. Otherwise, keep the version number unchanged for both.

Preparation

Assign this issue to a release owner.
Finalize scope and feature set and update the Public Roadmap.
All the tasks in this issue have been reviewed by the release owner.
Create, update, triage and label all features and issues targeted for this release with v2.5.0.
Cut 2.5 branch

CI/CD

All code changes for 2.5.0 are complete.
Ensure working and passing CI.
Check that this repo is included in the distribution manifest.

Pre-Release

Update to the 2.5.0 release branch in the distribution manifest.
Increment the version on the parent branch to the next development iteration.
Gather, review and publish release notes following the rules and back port it to the release branch.git-release-notes may be used to generate release notes from your commit history.
Confirm that all changes for 2.5.0 have been merged.
Add this repo to the manifest for the next developer iteration.

Release Testing

Find/fix bugs using latest tarball and docker image provided in parent release issue and update the release notes if necessary.
Code Complete: Test within the distribution, ensuring integration, backwards compatibility, and performance tests pass.
Sanity Testing: Sanity testing and fixing of critical issues found.
File issues for all intermittent test failures.

Release

Complete documentation.
Verify all issued labeled for this release are closed or labelled for the next release.

Post Release

Prepare for an eventual security fix development iteration by incrementing the version on the release branch to the next eventual patch version.
Add this repo to the manifest of the next eventual security patch version.
Suggest improvements to this template.
Conduct a retrospective, and publish its results.

[BUG] Broken Main branch

What is the bug?

Not able to build and work on the main branch of the repo.

How can one reproduce the bug?

Pull the neural search main branch and try running ./gradlew run command.

What is the expected behavior?

./gradlew run should be successful.

What is your host/environment?

Mac OS/

Do you have any screenshots?

Do you have any additional context?

This could be related to the breaking changes added in the OpenSearch by moving org.opensearch.common.xcontent.XContentBuilder to org.opensearch.core.xcontent.XContentBuilder.

[Semantic Search] Support default models for index/fields

Problem Statement

Currently, the neural-search plugin search functionality relies on the user to pass the "model_id" with each query.

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "model_id": "csdsdcsdsadasdcsad",
        "k": int
      }
    }

This offers a suboptimal user experience. The model IDs are randomized strings that add confusion to a given query. Additionally, search behavior has to change when the model is updated (the ID needs to be updated). While it may be possible to come up with some kind of alias scheme for the model ID (see opensearch-project/ml-commons#554), the best user experience would be for the user writing the query to not need to know any details about the model_id.

Potential Solutions

Goal

We want to offer a user experience like this:

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "k": int
      }
    }

Similarly, for indexing, the same information could be used if no model id is specified, so the experience would look like:

PUT _ingest/pipeline/<pipeline_name>
{
  "description": "string",
  "processors" : [
    {
      "text_embedding": {
        "field_map": {
           "<input_field_name>": "<output_field_name>",
           ...
        }
      }
    },
    ...
  ]
}

1. Rely on index meta field

In this option, we would associate the model mapping in a field in the _meta field of the index.

PUT my-neural-index
{
  "mappings": {
    "_meta": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  }
}

2. Make model map index settings

Similar to _meta field, we could make the map an index setting (would need to validate that settings can in fact be maps). Index settings would give us more control over validation of input model ids as well as some hooks to trigger actions when settings are updated.

PUT my-neural-index/_settings
{
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
}

3. Use system index

Using a system index is another approach to associating this model information with a given index. However, maintaining a system index is heavier than relying on a _meta field. This would require several APIs to manage the functionality. If we are to create a system index for model management, it would be better to group this functionality with ml-commons, which already has a model system index (see next option).

4. Rely on model index association during model management

Another alternative is to delegate functionality of associating a model with an index/field/function to the model management apis of ml-commons. In this solution, users would provide metadata during upload about what indices/fields/functions to associate a model with. This has the benefit of providing users the ability of abstracting all model management (including association) to ml-commons apis.

Requested Feedback

Currently, I am on the fence between the approaches of 1, 2 and 4 as my preferred solution and am looking for feedback on this. Additionally, if there are other alternative approaches we should consider, please post them here.

[RFC] Low Level Design for Normalization and Score Combination Query

Introduction

This issue describes details of Low Level Query clause Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of high-level design [RFC] High Level Approach and Design For Normalization and Score Combination is highly recommended. We expect antecedent design for API to be published soon, for now will be referencing to its Draft version.

Background

As per HLD and API LLD in scope of this feature we need a new Query clause in OpenSearch. This new Query will fetch results at a shard level for different sub-queries during the Query phase of request handling. Query results will be processed in a later phase by a new processor on coordinator node. Proposed name for new Query is "Hybrid".

At each shard execution will be independent from other shards. Focus of the change is to get all tops scores for each sub-query, all packing and reduce will happen in later stages.

New Query will be added as part of the Neural Search plugin and most of the code changes will be done in the plugin repo.

Requirements

Different sub-queries should be abstracted and not be limited by particular query types like k-NN or text match based on term or keywords.
New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level.

Scope

In this document we propose a solution for the questions below:

How do we handle sub queries of the main query

Out of Document Scope

Following items will be covered in other design documents (etc. API Design Design):

Query clause name.
Structure of the search query, including Normalization and Score Combination techniques and parameters.
How sub-query results are collected and transferred to coordinator node for normalization and combination.

Solution Overview

New query Hybrid will be registered in a Neural Search plugin using new QueryBuilder class. Builder class will create a new Query class implementation that will have logic to execute each sub-query and combine weights per sub-query at a shard level. New query needs a doc collector that will process results of each sub-query and get top x results (top docs) at shard level. This information should clearly identify to which sub-query each result belongs. Metrics like max and min query can be added if needed.
These results will be used by Query Phase Searcher to pack and send shard results to coordinator node for normalization and score combination.
Overall class structure is very similar to a DisjunctionMax query that is part of Lucene Defining _search API Input(#1): Score Combination and Normalization for Semantics Search[HLD]

Feature will be available to users of a Neural Search plugin, that is experimental. Once user enables the plugin Normalization and Score Combination feature became available automatically.

Risks / Known limitations

In this design we will not create new DTO object to store scores of individual sub-queries. This is required for doc collector and will be added later together with custom QueryPhaseSearcher implementation. For this implementation we will use existing core DTOs, this allows to collect and return scores from only first sub-query.

In initial implementation phase we are skipping pagination for query results. Mainly this is based on complexity of implementation, and foreseen performance overhead for some query type. For instance, k-NN query (which is a base query for semantic search in neural-search plugin) must collect not only last “page” in results, but also all previous pages (e.g. to get results 60-80 k-NN will select first 80 results and ignore 0-60 results). Such approach is very inefficient and is breaking functional requirement for minimal added latency.

In initial implementation phase we are skipping query explain. Feature will be released in experimental mode and we want to make it stable before providing details on how query results are formed.

In initial implementation phase we’ll be using default sequential sequence of execution for sub-queries.

Future extensions

pagination for query results
execute sub-queries in parallel
filters at the HybridQuery level (single filter for every sub-query result)‘
explain for query execution

Solution Details

We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryBuilder. Builder class creates instance of HybridQuery that will encapsulate logic for getting results for each of sub-queries. Part of the Query responsibility is creation of Weight and Scorer that both provide scores for results of each sub-query at the shard level.

Figure 1: Class diagram for Hybrid Query implementation

All the classes and logic above are agnostic to a type of sub-query, this allows to target one of the functional requirements about being query type agnostic

Below is the general data flow for getting query results for a new Hybrid Query. This represents a single shard, same execution happens for every shards in index.

Figure 2: General sequence diagram for getting query results

API Interface

For multiple sub-queries query supports json array

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores we need to go ahead and do it in this query clause
            ]
        }
    }
}

Single sub-query can be passed as a json object

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": 
                { /* neural query */ }, // this is added for an example
                // the scores we need to go ahead and do it in this query clause
        }
    }
}

Query Builder

Parses input and produces instance of Query class. For parsing of each sub-query we using using existing core logic from AbstractQueryBuilder.
Class has collection of query builders for each sub-query

Query

Rewrites each of sub-queries.
Class has collection of Query for each sub-query

Weight

Constructs weight object for each sub-query. Return HybridScorer object that has scorers for each sub-query.
Throws exception if get Explain is called.

Scorer

Responsible for iterating over results of each sub-query in desc order. Keeps the priority queue of doc id. For each next doc id get score from each sub-query. Sub-query scores are stored in array in a same order they were in the input query, it allows to map sub-query to its score.

Plugin

Registers Hybrid query and returns collection of QuerySpes for the NeuralSearch plugin.

Main implementation details related to potential security threats:

limit the number of sub-queries in a single hybrid query, 5 sub-queries max in first release
sub-queries will be parsed by existing logic in core
error messages will contain minimal information, mirroring user input should be avoided. Detailed information will be logged instead
integration test(s) for this feature will run as part of periodical CI task with security plugin enabled

Testability

Query is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.

Tests will be focused on overall query stability. Below are main test cases that will be covered:

build Hybrid Query from the user input if such input has no sub-queries, one sub-query and multiple sub-queries
fail with expected error message if user tries to build Hybrid Query from the incorrect or invalid input
fail with expected error message if number of sub-queries is more than designed max number (5)
low level query rewrite calls return expected results
serialization and deserialization of query objects work (this addresses cluster with multiple nodes)
base query and query builder functions - hashcode, equals, etc.

Mentioned tests are part of the plugin repo CI and also can be executed on demand from development environment.

Tests for metrics like score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.

Reference Links

Meta Issue for Feature: #123
[RFC] High Level Approach and Design For Normalization and Score Combination: #126
Dis_max Query: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-dis-max-query.html

[AUTOCUT] Integration Test failed for neural-search: 2.7.0 tar distribution

The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: tar
Architecture: x64
Platform: linux

Please check the logs: https://build.ci.opensearch.org/job/integ-test/4534/display/redirect

* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)

Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274

IllegalArgumentException when all embedding fields not shown or doing a partial update without embedding fields

What is the bug?

When all embedding fields not shown in the document or doing a partial update without embedding fields, an IllegalArgumentException will raise as below:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "empty docs"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "empty docs"
    },
    "status": 400
}

How can one reproduce the bug?

This issue is reported here: https://forum.opensearch.org/t/feedback-neural-search-plugin-experimental-release/11501/8.
First to create an ingest pipeline and an index as below:

PUT _ingest/pipeline/nlp-pipeline
{
    "description": "An example neural search pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": "kI6NhoQB3oLQzIJTkldg",
                "field_map": {
                    "text": "text_knn"
                }
            }
        }
    ]
}

PUT /my-nlp-index-1
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "nlp-pipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 128,
                        "m": 24
                    }
                }
            },
            "passage_text": {
                "type": "text"
            }
        }
    }
}

When inserting a new document

PUT my-nlp-index-1/_doc/1
{
    "name": "doc1"
}

When updating a new document
First to ingest a document successfully:

PUT my-nlp-index-1/_doc/1
{
    "text": "doc1"
}

Then do a partial update to this document:

PUT my-nlp-index-1/_doc/1
{
    "name": "doc2"
}

What is the expected behavior?

No exception raise and document should be ingested successfully.

What is your host/environment?

Irrelevant with environment.

[BUG] Nested fields in field_map cause pipeline to fail.

What is the bug?

When defining a field_map containing nested fields, the pipeline fails to compute embeddings.

How can one reproduce the bug?

With the following configuration, using non-nested field-types, embeddings are computed:

PUT /_ingest/pipeline/neural_pipeline
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message": "message_embedding"
        }
      }
    }
  ]
}
PUT /neural-test-index
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index","_id":"0"}}
{"message":"Text 1","color":"red"}
{"create":{"_index":"neural-test-index","_id":"1"}}
{"message":"Text 2","color":"black"}

GET /neural-test-index/_search
DELETE /neural-test-index

With the following configuration using a nested source field, embeddings are not computed:

PUT /_ingest/pipeline/neural_pipeline_nested
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message.text": "message_embedding"
        }
      }
    }
  ]
}

PUT /neural-test-index-nested
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline_nested"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message.text": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}

GET /neural-test-index-nested/_search

What is the expected behavior?

The neural ingestion pipeline should be able to handle nested fields.

What is your host/environment?

docker image: opensearchproject/opensearch:2.5.0

Do you have any additional context?

The models referenced above were uploaded with the following configuration:

{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "sentence transformers model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

Extending Neural Search pipeline to Named entity recognition and other metadata extracting models

Copying the customer request from Forum post: https://forum.opensearch.org/t/extending-neural-search-pipeline-to-named-entity-recognition-and-other-metadata-extracting-models/13078

I have a usecase to involve a named entity recognition model for documents and queries while indexing and querying. The documents will be filtered based on the presence of extracted entities against the query’s extracted entities. The pipeline will work similar to the existing neural search pipeline with one difference that in this usecase, the queries and documents will be passed through a NER (Named entity recogntion) model and added with extra metadata such as entities instead of vectors provided by an embedding model.

So if we are able to extend the usecase of neural-search pipeline to include model(s) that enable named entities extraction, embeddings, image segments (finding image components for image search) etc., so that the query/document extracts enough metadata through various models in the list of my neural search pipeline before matching.

Please do a +1 if you are looking for this feature. If possible do a comment explaining your usecase.

[AUTOCUT] OS Distribution Build Failed for neural-search-2.7.0

Received Error: Error building neural-search, retry with: ./build.sh manifests/2.7.0/opensearch-2.7.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/7395/consoleFull

Create custom neural query

Create a new query type: "neural". Internally, it will use ML-Commons to take a query string and create a vector from it. From there, it should build a k-NN query.

Interface will look like:

GET <index_name>/_search
{
  "size": int,
  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "string",
        "model_id": "string",
        "k": int
      }
    }
  }
}

vector_field — Field to execute k-NN query against
query_text — (string) Query text to be used to produce queries against.
model_id — (string) ID of model to do vector to encoding.
k — (int) Number of results to return from the k-NN search

For more details, refer to #11 (comment)

[META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25

Is your feature request related to a problem?

BM25 works well in exact match use cases and k-NN score works well in understanding context and getting relevant documents. It is important to get benefits from both of these relevancy mechanisms and one could achieve by combining these scores. One caveat is scores are on different scales and hence some kind of normalization is required.

Older Issues and Discussions:

opensearch-project/k-NN#717
opensearch-project/OpenSearch#4557
Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/

Tasks

High Level Tasks:

Community Requests

https://forum.opensearch.org/t/normalisation-in-hybrid-search/12996

Baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions

Follow opensearch-project/.github#125 to baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions.

Close this issue when:

MAINTAINERS.md has the correct list of project maintainers.
CODEOWNERS exists and has the correct list of aliases.
Repo permissions only contain individual aliases as collaborators with maintain rights, admin, and triage teams.
All other teams are removed from repo permissions.

If this repo's permissions was already baselined, please confirm the above when closing this issue.

Publish snapshots to maven via GHA

Coming from opensearch-build#3185

What is happening?

We are de-coupling the task of publishing the maven snapshots from centralized build workflow to individual repositories. What this means is each repo can now publish maven snapshots using GitHub Actions.
This change unblocks the dependent components from waiting for a successful build before they can consume the snapshots. This also ensures that all snapshots are independent and up to date.

What do you need to do?

Determine what artifacts you publish to maven. Most plugins publish only zips. Example.
Add below repository under publishing in build.gradle:

    repositories {
        maven {
            name = "Snapshots"
            url = "https://aws.oss.sonatype.org/content/repositories/snapshots"
            credentials {
                username "$System.env.SONATYPE_USERNAME"
                password "$System.env.SONATYPE_PASSWORD"
            }
        }
    }

Example

Determine what gradle task will publish the content to snapshots repo. Example: ./gradlew publishPluginZipPublicationToSnapshotsRepository.
Create a pull request to set up a GitHub action that will fetch the credentials and publish the snpahots to https://aws.oss.sonatype.org/ See sample workflow.
Add @gaiksaya as a reviewer, who will also take care of creating the IAM role that is required to fetch the credentials and add it to the GitHub Secrets.

Please feel free to reach out to @opensearch-project/engineering-effectiveness.

Release version 2.6.0

This is a component issue for 2.6.0.
Coming from opensearch-build__3081__. Please follow the following checklist.
Please refer to the DATES in that post.

How to use this issue

This Component Release Issue

Release Steps

The Overall Release Issue

What should I do if my plugin isn't making any changes?

If including changes in this release, increment the version on __2.x__ branch to __2.6.0__ for Min/Core, and __2.6.0.0__ for components. Otherwise, keep the version number unchanged for both.

Preparation

Assign this issue to a release owner.
Finalize scope and feature set and update the Public Roadmap.
All the tasks in this issue have been reviewed by the release owner.
Create, update, triage and label all features and issues targeted for this release with v2.6.0.

CI/CD

All code changes for __2.6.0__ are complete.
Ensure working and passing CI.
Check that this repo is included in the distribution manifest.

Pre-Release

Update to the __2.6__ release branch in the distribution manifest.
Increment the version on the parent branch to the next development iteration.
Gather, review and publish release notes following the rules and back port it to the release branch.git-release-notes may be used to generate release notes from your commit history.
Confirm that all changes for __2.6.0__ have been merged.
Add this repo to the manifest for the next developer iteration.

Release Testing

Find/fix bugs using latest tarball and docker image provided in parent release issue and update the release notes if necessary.
Code Complete: Test within the distribution, ensuring integration, backwards compatibility, and performance tests pass.
Sanity Testing: Sanity testing and fixing of critical issues found.
File issues for all intermittent test failures.

Release

Complete documentation.
Verify all issued labeled for this release are closed or labelled for the next release.
Verify the release date mentioned in release notes is correct and matches actual release date.

Post Release

Prepare for an eventual security fix development iteration by incrementing the version on the release branch to the next eventual patch version.
Add this repo to the manifest of the next eventual security patch version.
Suggest improvements to this template.
Conduct a retrospective, and publish its results.

Make NeuralSearch plugin extensible

Make the NeuralSearch class implement the Extensible interface.

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature

Is your feature request related to a problem?

Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.

What solution would you like?

As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.

By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.

Here is an example where we have the issue of pagination inconsistency when we use the general solution:
Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization:
x_normalized = (x – min) / (max – min)
and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.

Release version 2.7.0

This is a component issue for 2.7.0.
Coming from opensearch-build#3230. Please follow the following checklist.
Please refer to the DATES in that post.

How to use this issue

This Component Release Issue

Release Steps

The Overall Release Issue

What should I do if my plugin isn't making any changes?

If including changes in this release, increment the version on __2.x__ branch to __2.7.0__ for Min/Core, and __2.7.0.0__ for components. Otherwise, keep the version number unchanged for both.

Preparation

Assign this issue to a release owner.
Finalize scope and feature set and update the Public Roadmap.
All the tasks in this issue have been reviewed by the release owner.
Create, update, triage and label all features and issues targeted for this release with v2.7.0.

CI/CD

All code changes for 2.7.0 are complete.
Ensure working and passing CI.
Check that this repo is included in the distribution manifest.

Pre-Release

Update to the 2.7 release branch in the distribution manifest.
Increment the version on the parent branch to the next development iteration.
Gather, review and publish release notes following the rules and back port it to the release branch.git-release-notes may be used to generate release notes from your commit history.
Confirm that all changes for 2.7.0 have been merged.
Add this repo to the manifest for the next developer iteration.

Release Testing

Find/fix bugs using latest tarball and docker image provided in parent release issue and update the release notes if necessary.
Code Complete: Test within the distribution, ensuring integration, backwards compatibility, and performance tests pass.
Sanity Testing: Sanity testing and fixing of critical issues found.
File issues for all intermittent test failures.

Release

Complete documentation.
Verify all issued labeled for this release are closed or labelled for the next release.
Verify the release date mentioned in release notes is correct and matches actual release date.

Post Release

Prepare for an eventual security fix development iteration by incrementing the version on the release branch to the next eventual patch version.
Add this repo to the manifest of the next eventual security patch version.
Suggest improvements to this template.
Conduct a retrospective, and publish its results.

Introduce spotless style check

As multiple developers develop on this repo, adding a style check can help maintain consistency of coding standards. Im proposing we add spotless style check like OpenSearch core or the k-NN plugin (https://github.com/opensearch-project/k-NN/blob/main/gradle/formatting.gradle)

Implement OpenSearch core branching strategy

Description

Ensure MAJOR_VERSION.x branch exists, the main branch acts as source of truth effectively working on 2 versions at the same time.

Related META issue

opensearch-project/opensearch-plugins#142

Current Behavior

Currently plugins follow a branching strategy where they work on main for the next development iteration, effectively working on 2 versions at the same time. This is not always true for all plugins, the release branch or branch pattern is not consistent, the lack of this standardization would limit multiple automation workflows and alignment with core repo. More details on META ISSUE

Proposed solution

Follow OpenSearch core branching. Create 1.x and 2.x branches, do not create 2.0 as a branch of main, instead create main -> 2.x -> 2.0. Maintain working CI for 3 releases at any given time.

Not enough error information in error response when ingestion fails due to fields not present.

A customer got an error while trying out the neural plugin. Steps followed

Forum link: https://forum.opensearch.org/t/feedback-neural-search-plugin-experimental-release/11501/4

I got an error with the example in the documentation. Below is the code I tried. I tested the model and it is loading and working fine. However, ingesting the document fails giving the below error!

Environment: This was done on windows in development mode (one node as cluster_manager, data, ingest and ml)

Error:
{
“error” : {
“root_cause” : [
{
“type” : “illegal_argument_exception”,
“reason” : “empty docs”
}
],
“type” : “illegal_argument_exception”,
“reason” : “empty docs”
},
“status” : 400
}

My Code:
POST /_plugins/_ml/models/_upload
{
“name”: “all-MiniLM-L6-v2”,
“version”: “1.0.0”,
“description”: “test model”,
“model_format”: “TORCH_SCRIPT”,
“model_config”: {
“model_type”: “bert”,
“embedding_dimension”: 384,
“framework_type”: “sentence_transformers”
},
“url”: “[https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true 1](https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true)”
}

POST /_plugins/_ml/models/kI6NhoQB3oLQzIJTkldg/_load

POST /_plugins/_ml/models/kI6NhoQB3oLQzIJTkldg/_predict
{
“text_docs”:[ “today is sunny”]
}

PUT _ingest/pipeline/nlp-pipeline
{
“description”: “An example neural search pipeline”,
“processors” : [
{
“text_embedding”: {
“model_id”: “kI6NhoQB3oLQzIJTkldg”,
“field_map”: {
“text”: “text_knn”
}
}
}
]
}

PUT /my-nlp-index-1
{
“settings”: {
“index.knn”: true,

    "default_pipeline": "nlp-pipeline"
},
"mappings": {
    "properties": {
        "passage_embedding": {
            "type": "knn_vector",
            "dimension": 384,
            "method": {
                "name": "hnsw",
                "space_type": "l2",
                "engine": "nmslib",
                "parameters": {
                  "ef_construction": 128,
                  "m": 24
                }
            }
        },
        "passage_text": { 
            "type": "text"            
        }
    }
}
}

POST my-nlp-index-1/_doc
{
“passage_text”: “Hello world”
}

[RFC] OpenSearch neural-search plugin

Problem Statement

Traditionally, OpenSearch has relied on keyword matching for search result ranking. From a high level, these ranking techniques work by scoring documents based on the relative frequency of occurrences of the terms in the document compared with the other documents in the index. One shortcoming of this approach is that it can fail to understand the surrounding context of the term in the search.

With recent advancements in natural language understanding, language models have become very adept at deriving additional context from sentences or passages. In search, the field of dense neural retrieval (referred to as neural search) has sprung up to take advantage of these advancements (here is an interesting paper on neural search in Open Domain Question answering). The general idea of dense neural retrieval is to, during indexing, pass the text of a document to a neural-network based language model, which produces a dense vector(s) and index these dense vectors into a vector search index. Then, during search, pass the text of the query to the model, which again produces a dense vector and execute a k-NN search with this dense vector against the dense vectors in the index.

For OpenSearch, we have created (or are currently creating) several building blocks to support dense neural retrieval: fast and effective k-NN search can be achieved using the Approximate Nearest Neighbor algorithms exposed through the k-NN plugin; transformer based language models will be able to be uploaded into OpenSearch and used for inference through ml-commons.

However, given that these are building blocks, setting them up to achieve dense neural retrieval can be complex. For example, to use k-NN search, you need to create your vectors somewhere. To use ml-commons neural-network support, you need to create a custom plugin.

Sample Use Cases

In all use cases, we assume that the user knows what language model they want to use to vectorize their text data.

Vectorization on Indexing and Search

User understands how they want their documents structured and what fields they want vectorized. From here, they want an easy way to provide text-to-be-vectorized to OpenSearch and then not have to work with vectors directly for the rest of the process (indexing, or search).

Vectorization on Indexing, not on Search

User wants OpenSearch to handle all vectorization during indexing. However, for search, to minimize latencies, they want to generate vectors offline and build their own queries directly.

Vectorization on Search, not on Indexing

User already has index configured for search. However, they want OpenSearch to handle vectorization during search.

Proposed Solution

We will create a new OpenSearch plugin that will lower the barrier of entry for using neural search within the OpenSearch ecosystem. The plugin will host all functionality needed to provide neural search, including ingestion APIs/tools and also search APIs/tools. The plugin will rely on ml-commons for model management (i.e. upload, train, delete, inference). Initially, the plugin will focus on automatic vectorization of documents during ingestion as well as a custom query type API that vectorizes a text query into a k-NN query. The high level architecture will look like this:

For indexing, the plugin will provide a custom ingestion processor that will allow users to convert text fields into vector fields during ingestion. For search, the plugin will provide a new query type that can be used to create a vector from user provided query text.

Custom Ingestion Processor

The document ingestion will be implemented as an ingestion processor, which can be involved in customer defined ingestion pipelines.

The processor definition interface will look like this:

PUT _ingest/pipeline/pipeline-1
{
  "description": "text embedding pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "model_1",
        "field_map": {
          "title": "title.knn",
          "body": "body.knn"
        }
      }
    }
  ]
}

model_id — the ID of model adopted in text embedding
field_map — the mapping of input / output fields. Output filed will hold the embedding of each input field.

Custom Query Type

In addition to the ingestion processor, we will provide a custom query type, “neural”, that will translate user provided text into a k-NN vector query using a user provided model_id.

The neural query can be used with the search API. The interface will look like

GET <index_name>/_search
{
  "size": int,
  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "string",
        "model_id": "string",
        "k": int
      }
    }
  }
}

Further, the neural query type can be used anywhere in the query DSL. For instance, it can be wrapped in a script score and used in a boolean query with a text matching query:

GET my_index/_search
{
  "query": {
    "bool" : {
      "filter": {
        "range": {
          "distance": { "lte" : 20 }
        }
      },
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "passage_vector": {
                  "query_text": "Hello world",
                  "model_id": "xzy76xswsd",
                  "k": 100
                }
              }
            },
            "script": {
              "source": "_score * 1.5"
            }
          }
        }
        ,
        {
          "script_score": {
            "query": {
              "match": { "passage_text": "Hello world" }
            },
            "script": {
              "source": "_score * 1.7"
            }
          }
        }
      ]
    }
  }
}

In the future, we will explore different ways to combine scores between BM25 and k-NN (see related discussion).

Requested Feedback

We appreciate any and all feedback the community has.

Specifically, we are particularly interested in information around the following topics

Does your use case involve vectorizing and searching multiple fields?
What types of queries do you want to combine with neural search queries?
What strategies would you want to use to combine OpenSearch queries with neural queries during search?

Add k-NN jar as a dependency

In order to develop against k-NN, we will need to depend on its produced jar artifact.

Setup of Neural Search Plugin

Description

The task involves the setting up of the Neural Search plugin for development.

Tasks:

[AUTOCUT] OS Distribution Build Failed for neural-search-2.4.0

Received Error: Error building neural-search, retry with: ./build.sh manifests/2.4.0/opensearch-2.4.0.yml --component neural-search.
The distribution build for neural-search has failed.
Please see build log at https://build.ci.opensearch.org/job/distribution-build-opensearch/6267/consoleFull

[Testing Confirmation] Confirm current testing requirements

As part of the discussion around implementing an organization-wide testing policy, I am visiting each repo to see what tests they currently perform. I am conducting this work on GitHub so that it is easy to reference.

Looking at the Neural Search repository, it appears there is

Repository	Unit Tests	Integration Tests	Backwards Compatibility Tests	Additional Tests	Link
Neural Search				Certificate of Origin, Link Checker, Benchmarking Tool (in progress)	#124

I don't see any requirements for code coverage in the testing documentation. If there are any specific requirements could you respond to this issue to let me know?

If there are any tests I missed or anything you think all repositories in OpenSearch should have for testing please respond to this issue with details.

Update CI actions to run periodically

Description

The ask of this issue is to run the CI actions at a regular frequency for example: daily. This will make sure that if any of the dependency is breaking we are able to catch those error before releases.

Example:

https://github.com/opensearch-project/k-NN/blame/main/.github/workflows/CI.yml#L4

Issues:

#103

[AUTOCUT] Integration Test failed for neural-search: 2.7.0 tar distribution

The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: tar
Architecture: arm64
Platform: linux

Please check the logs: https://build.ci.opensearch.org/job/integ-test/4482/display/redirect

* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)

Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274

[FEATURE] : Search Phase Results Processor in Search Pipeline

Description

For Normalization and Score Combination feature, we need a way to do the Normalization of scores received for different sub-queries from different shards at Coordinator node, before we can start combing the scores. This needs to be done after the Query phase is completed and before Fetch Phase is started in a _search api call.

Solution

The solution I am proposing is to extend the Search Pipeline Processor to create a new Processor Interface called as Search Phase Processors that can run between Phases of Search(There are many phases Search like DFS, Query, Fetch, Expand etc).

As first use-case we will create a new SearchPhase Processor that will run between Query and Fetch phase, which do the Normalization of Scores and combine the Scores after normalization.

Alternatives

The alternative approaches are defined in the Section: Obtaining Relevant Information for Normalization and score Combination in #126

Task

Create RFC for New Processor Interface : #152
Implementation of Search Phase Processor Interface
Implementation of Normalization Processor
Testing

Reference Links

Integrating Open Search Benchmarks

Enable the OpenSearch Benchmarks in the Neural Search repo, so that we can do the benchmarking for Ingest, Query, Relevance.

Example: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb

[AUTOCUT] OS Distribution Build Failed for neural-search-2.4.0

[FEATURE] Add Windows Support

Coming from opensearch-project/opensearch-plugins#95, add Windows support.

Passing CI on Windows
Documentation

Integrate MLCommons API via MLClient into Neural Search Plugin

Description

As part of Neural Search plugin both ingestion and Search will require the inference API to convert the text or query string into embeddings by calling the right model. This issue tracks the development of Embedding module that will provide the abstraction in neural search plugin to call ML APIs via ML Client.

Tasks

Create Accessors which uses ML Client to call the ML APIs(inference APIs).

[RFC] High Level Approach and Design For Normalization and Score Combination

Introduction

This issue talks about the various high level directions which are being proposed for Score combination and Normalization techniques to improve Semantic Search/Neural Search Queries in OpenSearch(META ISSUE: #123). The proposals tries to use already created extensions in OpenSearch as much as possible. Also, we try to make sure that directions we are choosing are long term; provides different level of customizations for users to tweak the Semantics Search as per their needs. The document also proposes phased design and implementation plan which will help us to improve and add new feature with every phase.

For information how normalization improves the overall Quality of Results please refer this OpenSearch Blog for Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/

Current Architecture

For simplicity, let's us consider a 3 node OpenSearch cluster, where we have 2 data nodes and 1 coordinator node. The data node stores the data and coordinator node helps in running the request. This is how OpenSearch works on a very high level.

Request lands on a Coordinator node and this node will rewrite the queries and send out the request to all the required shards as part of Query Phase.
The shards then use that query and pass it to Lucene(Index Search) along with all the relevant Collectors. Lucene then again rewrite the query and start performing the leaf queries on the segments sequentially. Its uses the different collectors like TopDocsCollector to collect the documents with Top Scores and return this result back to Coordinator node.
Once the query results arrive on the Coordinator node, we move to next phase which is Fetch phase. In the Fetch phase we have different sub-fetch phases that runs. Important one is Source Phase and Score phase. The Score phase will again send out the request to Data nodes to get the sources for the document ids and build the response to send back to the clients.

Requirements

The customer here is referenced as the OpenSearch customers, who wants to use OpenSearch and wants to run Semantics Search in their application.

As a customer, I want to combine different results coming from Transformers query(K-NN Query) and BM-25 based queries to obtain better search results. Github link: opensearch-project/k-NN#717 (comment)
As a customer I want to experiment with different ways to combine and rank the results in final search output. Github Link: opensearch-project/OpenSearch#4557

Why current requirement cannot be fulfilled with current OpenSearch Architecture?

Currently, OpenSearch uses a Query and Fetch model to do the search. In this, first the whole query object is passed to all the shards, to obtain the top results from those shards and then fetch phase begins which fetches the relevant information for the documents. If we talk about 2 types of queries in a typical Semantic Search use case will have a K-NN query and text match query both of which produces scores using different scoring methods.

So at first, it is difficult to combine result sets whose scores are produced via different scoring methods. In order to effectively combine results, the different queries scores would need to be put on the same scale(refer https://arxiv.org/abs/2210.11934). By this, we mean is that a score would need to meet 2 requirements: (1) indicates its relative relevance between it and the other documents scored in the query and (2) be comparable with the relative relevance of results from other queries. For example, for k-NN, the score range may be 0-1 while BM25 scoring would be 0-Float.MAX. Hence any effective combination query clause like Bool will suffer the problems.
Second, it is not possible to consider global hits for re-ranking. Because scores are assigned at the shard level, any rescoring will be done at the shard level. Hence if we try to normalize the scores, the normalization will be local normalization and not the global normalization.
Let’s use below example to understand the problem in more details.

Example

Using the same cluster setup as defined on top and look at an example to understand the above 2 problems. For this example lets assume that we have an index whose schema looks like this:

PUT product-info
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "description": {
                "type": "text"
            },
            "tile_and_descrption_knn": {
                "type": "knn_vector",
                "dimension": 768
            }
        }
    },
    "settings": {
        "index": {
            "refresh_interval": "-1",
            "number_of_shards": "2",
            "knn": "true",
            "number_of_replicas": "0",
            "default_pipeline": "text-embedding-trec-covid"
        }
    }
}

The title and description are product title and description. The tile_and_descrption_knn field is a K-NN vector field which has a 768 dimensions dense vector created using Dense Vector Model.

Query
We are using Bool query clause to combine the results of K-NN(neural query converted to K-NN Query) and a text based search query. Bool Query should clauses have their scores combined — the more matching clauses, the better.

POST product-info/_search
{
    "query" : {
        "bool": {
            "should": [
                {
                    "multi-match": {
                        "query": "sofa-set for living room",
                        "fields": ["tile", "description"]
                    }    
                },
                {
                    "neural": {
                        "tile_and_descrption_knn": {
                            "query_text": "sofa-set for living room",
                            "model_id": "dMGrLIQB8JeDTsKXjXNn"
                        }
                    }
                }
            ]
        }
    }
}

Scores Computation Examples Happening at different Levels
Combination using above provided query: As we can see here because both K-NN and BM-25 scores are at different scale if one of the query behaves badly like example for document d8, it is still coming as the first response because of BM-25. The standard boolean combination is not taking the advantage of relative ranking. Documents like d7 are still lower down the results even when it has good scores in both BM-25 and K-NN. This problem will become more elevated as BM-25 scores are unbounded.

To solve this problem of joining scores for different queries running at different scale is to first Normalize the scores of both the queries and then combine the scores. We can read about the same here(https://arxiv.org/abs/2210.11934).

Using Normalization: To see how normalization works let’s look at the below table. In the below table we have done 2 types of normalization one which is done for every shard/data node and another at the Coordinator node(Global Normalization).

Note: I did the normalization using the scores present here hence documents have 0 scores after normalization.

Final Sorted Documents to be returned based on above examples:

If we keep our focus on the d8 document we can see how that document changes its position based on without normalization, with Local Normalization and with Global Normalization. As Local Normalization only considers the scores at per shard level, it can suffer in cases if one of the score is lowest. But in the Global Normalization(aka Normalization at Coordinator Node level), as we start to look scores from whole corpus it smoothens out the above problem, because it can happen that more bad scores can come from other shards. We did experiments on that to verify it.

System Requirements

Functional Requirements:

System should be able to normalize the scores of each sub-query by using the relevant corpus level information.
System should be able to combine normalized scores of these different subqueries by using different techniques provided as input and by considering global rankings.
System should be able to get the information like max score and min score etc of the whole corpus, as these are required attributes for normalization.
System should be able to combine(not necessary sum) the results returned from different shards based at global level.
The solution should be generic enough to combine the scores of subqueries and not limited to k-NN and text matching.

Good to Have but not as P0:

The proposed solution should able to provide support different features like pagination, scripting, explain etc.
The solution shouldn’t degrade in terms of latency and CPU consumption over a normal query which does combination at Shard Level.

High Level Directions for Building the Solution

If we look at the requirements we see that we need solutions at different level of the whole search api flow. We can divide the whole flow in 3 parts:

Define the API input which can be used to run the sub-queries independently and return results at Coordinator node separately.
Obtain the relevant information like max scores, min scores at the global level for the provided sub-queries and perform normalization.
Define a component to normalize the all results obtained from different shards and combine them based on algorithms provided by users. After that let the Fetch phase run, to get the source for the documents ids, as it was working earlier.

Defining _search API Input

The proposal for the input is to use the _search api and define a new compound query clause. This new compound query clause will hold the array of Queries which will be executed in parallel at per data node level. The name of the new query clause is not yet defined. The interface of this new query clause is inspired from dis_max query clause. But dis_max query clause runs the queries sequentially. The new query clause will make sure that Scores are calculated at shard level independently for each sub query. The sub-queries re-writing will be done at Coordinator level to avoid duplicate computations.

Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach. first

PUT <index-name>/_search
{
    "query": {
        "<new-compound-query-clause>": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores he need to go ahead and do it in this query clause
            ],
            ... other things to be added and will come as part of next sections
        }
    }
}

Pros:

From customer standpoint, all their API calls will remain same, and they need to update only the body of the request.
From cohesion standpoint, as we are doing the search it make sense to include with the _search api to provide a unified experience for customers who are doing search via OpenSearch.
Less maintenance and consistent output format, as the new compound query is integrated with _search api.
Integration with other search capabilities like Explain Query, Pagination, _msearch will be possible, rather than reinventing the wheel.

Cons:

From implementation standpoint, we need define new concepts in OpenSearch like new Query clause, which will require customer education in terms of how to use it.

Alternatives Considered

Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:

This will provide flexibility for the team to do experiments without touching core capabilities of OpenSearch.
Easier implementation as the new rest handlers is limited to Neural Search Plugin.

Cons:

Duplicate code and interfaces as we will be implementing the same search api functionality(size, from and to, include source fields, scripting etc.)
A higher learning curve and difficult in adoption for customers who are already using _search api for other search workloads.

Obtaining Relevant Information for Normalization and score Combination

This section talks about how OpenSearch will get the relevant information required for doing the Normalization. For example purpose lets say customer has defined the min-max normalization so for every subquery we will need the min and max sore for the whole corpus.
OpenSearch during the query phase it uses a QueryPhaseSearcher class to do the Query and collect the documents at shard level using TopDocsCollector interface. There is no extension point present in the QueryPhaseSearcher to use a different implementation of TopDocsCollector. The only extension we have is that a plugin can define a new QueryPhaseSearcher implementation. So we will define a new QueryPhaseSearcher implementation which will implement a new TopDocsCollector interface at shard level to gather the relevant information for doing normalization.

Pros:

It provides a cleaner way to modify the query phase at a shard level without adding extra round trips.
The interface is provides us full power on how to implement the query and return the results, so we can keep on updating this

Cons:

As of 1/18/2023, only a single plugin can define the queryphase searcher in the whole OpenSearch. We need to fix this to make sure that multiple QueryPhaseSearcher implementations can be added by plugins. This is similar to what we did for K-NN where only 1 engine implementation can be defined for the whole OpenSearch and CCR and K-NN was not working together.
We can get possible conflicts with Concurrent Phase Searcher defined here, as it also defines a query phase searcher.

Alternatives Considered

Alternative1: Enhance DFS Query Search type or create a new Query Search type to get information for doing normalization
The default search type in OpenSearch is Query and the Fetch, which first query the results and then fetch the actual source from the shards. The DFS query is another search type which customer can put as a query params to change the search type. In DFS Query and Fetch, OS will first Prequery each shard asking about Term and Document frequencies and send this information to all the shards where scores are calculated using this global Term/Document Frequencies.

We can build similar to this, where we can do the pre-query to find the min-max scores from all the shards and then pass this information where each shard can do the normalization of the scores for each sub-query.

Pros:

This avoids adding new phases/transformers in between query and fetch phases.

Cons:

An extra round trip to data nodes will be added which will lead to increase in latency. DFS query and fetch also faces this extra latency.
For DFS query and Fetch the information is already precomputed and present in the Lucene files (postings format files but for Normalization as the score calculation needs to be done we will end up running the queries twice. For K-NN queries this pre-fetch can only be done if we run the K-NN query.

Normalizing and Combining Scores using Search Pipeline

The idea here is to extend the Search Pipeline Request and Response transformers to create another type of Transformer which will be called after the query phase is completed. We will use this transformer interface to do the normalization and score combinations for the document ids returned from Query phase as per the user inputs. The transformed result will then be passed on the Fetch phase which will run as it is.
Below is the modified input of the above proposed API input. It adds relevant fields to do the Normalization and Score Combination.

Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach.

PUT /_search_processing/pipeline/my_pipeline
{
  "description": "A pipeline that helps in doing the normalization",
  "<in-between-query-fetch-phase-processor>": [
    {
        "normalizaton-processor": {
            // we can bring in the normalization info from _search api to this place if required
            // It will be discussed as part of LLD.
        }
    }
  ]
}


PUT <index-name>/_search?pipeline=my_pipeline
{
    "query": {
        "<new-compound-query-clause>": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores he need to go ahead and do it in this query clause
            ],
            // The below items be a part of processor also
            "normalization-technique" : "min-max" // min-max etc.., Optional Object
            "combination" : {
                "algorithm" : "harmonic-mean", // all the values defined in #3 above, interleave, harmonic mean etc
                "parameters" : {
                    // list of all the parameters that can be required for above algo
                    "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
                }
            }
        }
    }
}

Alternatives Considered

Alternative-1: Create a new phase in between query and fetch phase
The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:

No specific pros that I can think for this approach apart from we will not have any dependency on Search Pipelines.

Cons:

Currently there is not extension points in OpenSearch to create a phase, so we need to build everything from scratch.
Problems will arrive in during implementation where code need to identify for which queries this new phase to run, and then we need implement some sophisticated logic to identify that.

Alternative-2: Create a Fetch Subphase which do the Normalization and score combination
OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetchsubphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:

No new extension points needs to be created as adding new subphases in Fetch phase is already present.

Cons:

Order of execution of fetch subphases is not consistent. It depends on which plugin got registered first and the fetch sub-phases of that plugin will run first. This will create inconsistency across clusters with different set of plugins. Code reference.
There is a source subphase which gets the source info for all the docIds, running this before normalization will make OpenSearch get sources for document Ids which we will not be sending in response. Hence waste of computation.

Alternative-3: Use SearchOperationListeners interface
SearchOperationListener runs at a Shard Level and not the Coordinator Node Level. Please check this, and this code reference. Hence we cannot use SearchOperationListeners. As we need the normalization to be done at Coordinator Node Level.

High Level Diagram

Based on the above 3 directions, below is the high level flow diagram. The 2 subqueries are provided as an example. There will be a limit on how many subqueries a customer can define in the query clause. This max number for the subqueries we will be keeping is 10(There is no specific reason for this limit, just want to make sure that we have limits imposed to make sure to avoid long running queries leading to Circuit Breaker and Cluster failures).

** There can be many sub-queries in the new compound query.**

Future Scope or Enhancements which will be picked up during next phases

Implementing Pagination in new Compound Query Clause

The proposed design doesn’t support for the Paginated Queries in the first phase to reduce the scope of the phase-1 launch. We also have not done deep-dive on how this can be implemented and what is the current solution of pagination.

Enhance the explain query functionality for new compound Query Clause

With the phase-1 implementation of the new query we won’t be providing the explain functionality for the query clause. Explain api provides information about why a specific document matches (or doesn’t match) a query. This is very useful api for customers to understand and do debugging.

Enabling Parallel/Concurrent Search on Segments functionality for new Compound Query

The idea here is to enable the parallel search for all the different queries provided in the Compound Query to improve the performance for this query. Parallel Search on Segments is already in sandbox for OpenSearch(https://github.com/opensearch-project/OpenSearch/tree/main/sandbox/plugins/concurrent-search).

Implement Script based Combination to allow customers to use any Score combination Techniques

The initial proposal provides customers only a set of functions to combine the scores with Script based combination customer can define custom scripts that can be used to combine the scores before we rank them.

Integrating the compound query to be written via Querqy etc to provide better experience for customers

The idea here is the new compound query clause can become overwhelming, hence we want to integrate it with different Query writing helpers like Querqy etc to facilitate easy query writings for customers.

Launch Plan

Below are some high level phased approach for building the feature. These are not set in stone and may change as we start making progress in the implementation

Phase-1

Given that we are defining new compound query clause for OpenSearch we will launch these features defined in this document and high level design under feature flag. High Level items:

Customers can use new compound query to do the normalization, we will lay the ground work and low level interfaces for doing running the compound query.
Adding support in OpenSearch to provide more than 1 QueryPhaseSearcher implementation
Customer can use standard normalization and score combination techniques defined in the phase 1. See appendix section for the different normalization and score combination techniques that will be present in phase-1.
Provide the first set of Performance testing results for customers with this new compound query.

Phase-2

The phase-2 will focus on these items:

Solidify the compound query interfaces and do GA after taking customer feedbacks.
We will add the capability for doing the Pagination queries for this new compound query clause.
Add explain query functionality for new this new compound query clause.
Integration in different Language specific opensearch clients.

Phase-3

The phase-3 will focus on these items:

Enabling Parallel segments search for the new compound query.
Implement Script based Combination to allow customers to use any Score combination Techniques

Phase-4

By this time we will have a good understanding of how customer are using this new compound query. This phase will start to focus on how we can now make it easier for customer to start using this new query clause. The below item helps us in doing that:

Integrating the compound query to be written via [Querqy] (https://opensearch.org/docs/latest/search-plugins/querqy/index/) to provide better experience for customers

FAQ:

Does other products support combining scores at Global Level?

Yes, Elastic Search supports this feature. But it combines the results of K-NN query with the text match query only. It is not generic enough. Also ES doesn’t support normalization of scores globally. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/knn-search.html#_combine_approximate_knn_with_other_features.
As per the documentation limitation is:
Approximate kNN search always uses the dfs_query_then_fetch search type in order to gather the global top k matches across shards. You cannot set the search_type explicitly when running kNN search.

Example Query:

POST image-index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "mountain lake",
        "boost": 0.9
      }
    }
  },
  "knn": {
    "field": "image-vector",
    "query_vector": [54, 10, -2],
    "k": 5,
    "num_candidates": 50,
    "boost": 0.1
  },
  "size": 10
}

Open Questions

Question1: Should we consider the case where subqueries are getting normalized using different techniques or customer only wants to normalize the single query and not the other queries?

After discussing, this is a very valid use case. But we don’t have enough data to prove this hypothesis. This will really depend on customer use case. I would suggest to start with lets do normalization on all the sub-queries as we can see from this blog that we should do normalization on all the sub-queries. Also as this is not a one way door.

Question2: How we are going to calculate the min score for the queries globally?

We will have a min score from different shards but as of now we don’t have a way to find the global min score for the K-NN queries. To do that we need to run exact K-NN. As of now I am trying to find a way to do this. For text matching, as we will iterate over the all the segments we will have min score. But more deep-dive is required the feasibility of the solution.

Next Steps:

I will working on creating the POCs for the proposal to validate the understandings of Query Clauses, DocsCollector and SearchPhase Searcher etc.
The API interface created for _search api is not final. This is more for understanding I have put my initial thought process. I will be creating more github issues to discuss on that in more details.

Appendix

What is Normalization?

Normalization is a data transformation process that aligns data values to a common scale or distribution of values.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

What are different ways to do the normalization?

1. y = (x – min) / (max – min)

Score Combination Techniques

Now you have 2 or more results. You need to combine both of them. There can be many way you can combine the results(Geometric Mean, Airthmatic Mean etc).

Approach 1: Normalized arithmetic mean

Assume we have 2 sets of results, resultsa and resultsb. Each result has a score and a document id. First, we will only consider the intersection of results in a and b (i.e. resultsc=resultsa∩resultsb). Then, each document in resultsc will have 2 scores: one from a and one from b. To combine the scores, we will first normalize all scores in resultsa and resultsb, and then take the arithmetic mean of them:

score=(norm(scorea)+norm(scoreb))/2

Approach 2: Normalized geometric mean

Similar to Approach 1, but instead of taking the arithmetic mean, we will take the geometric mean:
score=(norm(scorea)∗norm(scoreb))

Approach 3: Normalized harmonic mean

Similar to Approach1, but instead of taking the arithmetic mean, we will take the harmonic mean:

score=2/(1/norm(scorea)+1/norm(scoreb))

Approach 4: Normalized Weighted Linear Combination

Instead of taking the mean of the scores, we can just try different weights for each score and combine them linearly.

score=wa∗norm(scorea)+wb∗norm(scoreb)

Approach 5: Normalized Weighted Geometric Combination

Similar to above approach, but instead of combining with addition, we can combine with multiplication:

score=log(1+wa∗norm(scorea))+log(1+wb∗norm(scoreb))

This approach has previously been recommended for score combination with OpenSearch/Elasticsearch: elastic/elasticsearch#17116 (comment).

Approach 6: Interleave results

In this approach, we will produce the ranking by interleaving the results from each set together. So ranking 1, 3, 5, ... would come from resultsa and 2, 4, 6, ... would come from resultsb.

Reference Links:

Meta Issue for Feature: #123
Compound Queries: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/compound-queries.html
Dis_max Query: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-dis-max-query.html
DFS Query and Fetch: https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
Querqy: https://opensearch.org/docs/latest/search-plugins/querqy/index/
Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/

Tests failing due to ml-commons default setting

Related to opensearch-project/ml-commons#688, our tests fail because the default value for this setting is true: "plugins.ml_commons.only_run_on_ml_node"

We will need to update it before running tests with:

PUT /_cluster/settings
{
  "persistent" : {
    "plugins.ml_commons.only_run_on_ml_node" : false 
  }
}

[AUTOCUT] Integration Test failed for neural-search: 2.7.0 deb distribution

The integration test failed at distribution level for component neural-search
Version: 2.7.0
Distribution: deb
Architecture: arm64
Platform: linux

Please check the logs: https://build.ci.opensearch.org/job/integ-test/4544/display/redirect

* Steps to reproduce: See https://github.com/opensearch-project/opensearch-build/tree/main/src/test_workflow#integration-tests
* Access cluster logs:
- With security (if applicable)
- Without security (if applicable)

Note: All in one test report manifest with all the details coming soon. See opensearch-project/opensearch-build#1274

Create text_embedding type processor for ingest pipeline processing

Problem statement

Before querying with neural search, all the documents should be ingested in the form of embedded vectors. Leaving the embedding process to the user offline will raise the bar of use. Thus we propose to implement the ingestion pipeline of document embedding. It consists of two parts:

User will have the flexibility to decide which field should be embedded and which need not.
A model_id will be dedicated for each processor and this kind of mapping can be retrieved by the querying phase.

Solution

A new processor type needs to be created with name: text_embedding, this processor has two parameters: model_id and field_map. Different models can produce different embedding results, user can specify a model_id which is already uploaded and loaded, field_map is the configuration user can specify which fields should apply text embedding in the ingestion pipeline. An example below:

PUT _ingest/pipeline/text-embedding-pipeline
{
    "description": "Text embedding pipeline for several fields",
    "processors": [
        {
            "text_embedding": {
                "model_id": "WYjkv4MBHcWxVq8Jtc8U",
                "field_map": {
                    "title": "title_knn",
                    "body_list": "body_list_knn",
                    "favorites": {
                        "game": "game_knn",
                        "movie": "movie_knn"
                    }
                }
            }
        }
    ]
}

[RFC]: Search Phase Results Processor

Introduction

This issues proposes a new Processor Interface for Search Pipeline which will run in between the Phases of Search. This will allow plugins to transform the results retrieved from one phase before it goes to next phase at the Coordinator Node Level.

Background[tl;Dr]

This RFC proposes a new set of APIs to manage Processors to transform Search Request and Responses in OpenSearch. The Search Pipeline will be used to create and define these processors. Example:

Creating Search Pipeline

// Create/update a search pipeline.
PUT /_search/pipeline/mypipeline
{
  "description": "A pipeline to apply custom synonyms, result post-filtering, an ML ranking model",
  "request_processors" : [
    {
      "external_synonyms" : {
        "service_url" : "https://my-synonym-service/"
      }
    },
    {
      "ml_ranker_bracket" : {
        "result_oversampling" : 2, // Request 2 * size results.
        "model_id" : "doc-features-20230109",
        "id" : "ml_ranker_identifier"
      }
    }
  ],
  "response_processors" : [
    {
      "result_blocker" : {
        "service_url" : "https://result-blocklist-service/"
      },
      "ml_ranker_bracket" : {
        // Placed here to indicate that it should run after result_blocker.
        // If not part of response_processors, it will run before result_blocker.
        "id" : "ml_ranker_identifier" 
      }
    }
  ]
}

// Return identifiers for all search pipelines.
GET /_search/pipeline

// Return a single search pipeline definition.
GET /_search/pipeline/mypipeline

// Delete a search pipeline.
DELETE /_search/pipeline/mypipeline

Search API Changes

// Apply a search pipeline to a search request.
POST /my-index/_search?search_pipeline=mypipeline
{
  "query" : {
    "match" : {
      "text_field" : "some search text"
    }
  }
}

// Specify an ad hoc search pipeline as part of a search request.
POST /my-index/_search

{
  "query" : {
    "match" : {
      "text_field" : "some search text"
    }
  },
  "pipeline" : {
    "request_processors" : [
      {
        "external_synonyms" : {
          "service_url" : "https://my-synonym-service/"
        }
      },
      {
        "ml_ranker_bracket" : {
          "result_oversampling" : 2, // Request 2 * size results
          "model_id" : "doc-features-20230109",
          "id" : "ml_ranker_identifier"
        }
      }
    ],
    "response_processors" : [
      {
        "result_blocker" : {
          "service_url" : "https://result-blocklist-service/"
        },
        "ml_ranker_bracket" : {
          // Placed here to indicate that it should run after result_blocker.
          // If not part of response_processors, it will run before result_blocker.
          "id" : "ml_ranker_identifier" 
        }
      }
    ]
  }
}

Index Settings

// Set default search pipeline for an existing index.
PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "my_pipeline"
  }
}

// Remove default search pipeline for an index.
PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "_none"
  }
}

// Create a new index with a default search pipeline.
PUT my-index
{
  "mappings" : {
    // ...index mappings...
  },
  "settings" : {
    "index" : {
      "default_search_pipeline" : "mypipeline",
      // ... other settings ...
    }
  }
}

Requirement

Solution

The proposed solution is to extend the Search Pipeline Processor Interface to create a new Processor Interface that can run between Phases. We will be onboarding Normalization use-case as the first use case for this processor which will run after Query Phase and Before Fetch Phase of Search Request.

Proposed Flow Chart

The below flow chart assumes that the processor is running after Query Phase and before Fetch phase for search_type=query_then_fetch which is a default search type. But none of the interface assumes that these are the only 2 phases in the OpenSearch.

Note: Above diagram can be updated via this link.

Proposed Interface (Recommended)

interface SearchPhaseResultsProcessor extends Processor {

    <Result extends SearchPhaseResult> void
            process(final SearchPhaseResults<Result> results, final SearchPhaseContext context);
  /**
    This function is called by Search Pipeline Service before invoking the processor.
  */          
   
   <Result extends SearchPhaseResult> boolean shouldRunProcessor(
        final SearchPhaseResults<Result> results, 
        final SearchPhaseContext context,
        final SearchPhaseNames beforePhase,
        final SearchPhaseNames nextPhase);
}

/**
Currently when we create phases we pass string as the phase name. This enum class
will allow us to define the phase names and use them at different places when required.
*/
// mark internal
public final enum class SearchPhaseNames {
// There are many more, I just added few here.
   QUERY_PHASE("query"), FETCH_PHASE("query"), DFS_QUERY_PAHSE("dfs_query");
   
   @Getter
   String name;
}

SearchPhaseNames enum class will provide necessary abstraction and proper naming convention for OpenSearch phase names.

Pros:

The logic to run a Processor is abstracted with the processor and exposed via a well defined interface.

Cons:

I don’t see specific cons of this approach. But we have to make SearchPhase class publically accessible which it is not currently. But we can get around this, if we can create SearchPhaseName enums which can be passed.

Proposed API

// Create/update a search pipeline.
PUT /_search/pipeline/my_pipeline
{
  "description": "A pipeline that adds a Normalization and Combination Transformer",
  "phase_results_processors" : [
    {
      "normalization-processor" : {
        "technique": "min-max", // there can be others techniques. I know this only for now.
      }
    }
  ]
}

// all other APIs remain same for making this pipeline as default pipeline etc.

Alternatives Interface:

The idea here is that the SearchPipeline Service before invoking the execute function of a processor will check whether the condition to run the processor is met or not. The way it checks is it calls the getBeforePhases and getAfterPhases function and checks validates phase which got recently completed is in the the BeforeList and next phase which will be run is in AfterList or not.

interface SearchPhaseResultsProcessor extends Procesor {

    <Result extends SearchPhaseResult> void
            process(final SearchPhaseResults<Result> results, final SearchPhaseContext context);
            
   /**
      Returns a list of phases, before which this processor should be run.
   */
   
   List<SearchPhaseNames> getBeforePhases();
    
   /**
      Returns a list of phases, after which this processor should be run.
   */ 
   List<SearchPhaseNames> getAfterPhases();
}

/**
Currently when we create phases we pass string as the phase name. This enum class
will allow us to define the phase names and use them at different places when required.
*/
public final enum class SearchPhaseNames {
// There are many more, I just added few here.
   QUERY_PHASE("query"), FETCH_PHASE("query"), DFS_QUERY_PAHSE("dfs_query");
   
   @Getter
   String name;
}

Alternatives To Search Pipeline:

Use Search Operation Listener to run code after Query phase is completed

The Search Operation Listeners works at shard level and not the coordinator node level. We need to do the normalization at Coordinator node. Hence rejecting this solution. Please check this, and this code reference. It comes in the code path when Query is getting executed at Shard Level.

Create a new Phase between Query and Fetch Phase

The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:

No specific pros that I can think for this approach.

Cons:

Currently there is not extension points in OpenSearch to create a phase, so we need to build everything from scratch.
Problems will arrive in during implementation where code need to identify for which queries this new phase to run, and then we need implement some sophisticated logic to identify that.

Create a new Fetch Sub-phase

OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetch subphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:

No new extension points needs to be created as adding new subphases in Fetch phase is already present.

Cons:

Order of execution of fetch subphases is not consistent. It depends on which plugin got registered first and the fetch sub-phases of that plugin will run first. This will create inconsistency across clusters with different set of plugins. Code reference.
There is a source subphase which gets the source info for all the docIds, running this before normalization will make OpenSearch get sources for document Ids which we will not be sending in response. Hence waste of computation.

References

Feature Request: #139
Search Pipeline RFC: opensearch-project/search-processor#80
Search Pipeline PR: opensearch-project/OpenSearch#6587

Feedback Requested:

~~We are naming this new Processor interface as SearchPhase Processor, because they run during the phases of Search. Please let us know your thoughts and any better name around this.~~ Based on the comments in the PR opensearch-project/OpenSearch#7283, we renamed the interface to SearchPhaseResultsProcessor as it operates on the SearchPhaseResults
Any other comment to improve the interface or functionality.

README points to geospatial plugin

Minor issue: https://github.com/opensearch-project/neural-search#opensearch-neural-search has a link to geospatial.

[Refactor] xcontent from common to core namespace

XContent namespace refactor from common -> core is going to be merged to opensearch/2.x which will break the 2.x build. This issue is for refactoring XContent imports from the common to core namespace after the core namespace change is merged.

Depends on opensearch-project/OpenSearch#6470

Release Version 2.4.0

This is a component issue for 2.4.0.
Coming from opensearch-build#2649. Please follow the following checklist.
Please refer to the DATES / CAMPAIGNS in that post.

How to use this issue

This Component Release Issue

Release Steps

The Overall Release Issue

What should I do if my plugin isn't making any changes?

If including changes in this release, increment the version on 2.0 branch to 2.4.0 for Min/Core, and 2.4.0.0 for components. Otherwise, keep the version number unchanged for both.

Preparation

Assign this issue to a release owner.
Finalize scope and feature set and update the Public Roadmap.
All the tasks in this issue have been reviewed by the release owner.
Create, update, triage and label all features and issues targeted for this release with v2.4.0.
Cut 2.4 branch

CI/CD

All code changes for 2.4.0 are complete.
Ensure working and passing CI.
Check that this repo is included in the distribution manifest.

Pre-Release

Update to the 2.4.0 release branch in the distribution manifest.
Increment the version on the parent branch to the next development iteration.
Gather, review and publish release notes following the rules and back port it to the release branch.git-release-notes may be used to generate release notes from your commit history.
Confirm that all changes for 2.4.0 have been merged.
Add this repo to the manifest for the next developer iteration.

Release Testing

Find/fix bugs using latest tarball and docker image provided in parent release issue and update the release notes if necessary.
Code Complete: Test within the distribution, ensuring integration, backwards compatibility, and performance tests pass.
Sanity Testing: Sanity testing and fixing of critical issues found.
File issues for all intermittent test failures.

Release

Complete documentation.
Verify all issued labeled for this release are closed or labelled for the next release.

Post Release

Prepare for an eventual security fix development iteration by incrementing the version on the release branch to the next eventual patch version.
Add this repo to the manifest of the next eventual security patch version.
Suggest improvements to this template.
Conduct a retrospective, and publish its results.

[AUTOCUT] OS Distribution Build Failed for neural-search-2.7.0

[FEATURE] Sentence Highlighter

Is your feature request related to a problem?

What solution would you like?

I would like to have a highlighter that supports the neural search capability. It should highlight the most relevant sentences in the neural search resulting documents.

What alternatives have you considered?

There are no available alternatives at the moment. So the only choice is to develop one.

Do you have any additional context?

I tried to implement it myself but faced the following challenges:

I had to implement my own neural-search plugin since this one relies on KNNQuery which does store the query text. For example, in the below, fieldContext.context.query() returns an instance of KNNQuery. I suggest that the neural-search plugin has its own NeuralQuery that extends KNNQuery and keeps neural search related attributes such as query text. I hope there are other ways to get the query text at highlight time.

@OverRide
public HighlightField highlight(FieldHighlightContext fieldContext) {
System.out.println("Query: "+fieldContext.context.query());
}

The inferenceSentences method is asynchronous notifies an ActionListener after the result is retrieved. If I call it inside the above highlight method then the highlight method will return before the actionlistener is notified and thus won't be able to get the embeddings to compute sentence similarity and get the sentence to highlight. I had to implement my own synchronous inferSentences. Below is a pseudo code of what I am trying to do.

@OverRide
public HighlightField highlight(FieldHighlightContext fieldContext) {
System.out.println("highlighting..");
List responses = new ArrayList<>();
String queryText = get query text from fieldContext.context.query()

    List<Float[]> embeddings = new ArrayList<>();

    List<String> sentences= get sentences from search hit
    sentences = query + sentences


    
    List<List<Float>> vectors = clientAccessor.inferSentences("U3R9CYcBOk2JRjrls0nH", sentences);

    for(List<Float> v:vectors)
        {
            List<Float> s = v;
            embeddings.add(s.stream().toArray(Float[]::new));
        }
        System.out.println("Computing similarity");
        double maxSim = 0;
        String maxSentence = null;
        if(embeddings.size()>0)
        {
            Float[] queryEmbedding = embeddings.get(0);
            for(int i=1;i<embeddings.size();i++)
            {
                float sim = consineSim(queryEmbedding, embeddings.get(i));
                set maxSim and maxSentence
            }
        }
    responses.add(maxSentence);

    return new HighlightField(fieldContext.fieldName, responses.toArray(new Text[] {}));
}

Having said the above, I hope that you tell what is the route to take here. Is this feature going to be available in the plugin any time soon?

Thanks

[FEATURE] Treat . in the field name as a nested field in the fields map of text embedding processor

Is your feature request related to a problem?

This is related to customer created Github issue: #109

The following configuration using a nested source field, embeddings are not computed, which should be supported:

PUT /_ingest/pipeline/neural_pipeline_nested
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message.text": "message_embedding"
        }
      }
    }
  ]
}

PUT /neural-test-index-nested
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline_nested"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message.text": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}

GET /neural-test-index-nested/_search

What solution would you like?

The fields map keys should support . operator to define the nested fields.

What alternatives have you considered?

Customer can create a nested field mapping using:

PUT /neural-test-index-nested
{
    "description": "Neural Search Pipeline for message content",
    "processors": [
        {
            "text_embedding": {
                "model_id": "SXXx8YUBR2ZWhVQIkghB",
                "field_map": {
                    "message": {
                        "text": "message_embedding"
                    }
                }
            }
        }
    ]
}

Update Main branch of Plugin to point to 3.0.0

Description

Update Main branch of Plugin to point to 3.0.0 for OpenSearch and all the dependent plugins.

[BUG] Runtime error when using explain=true with multiple script_score neural queries (Null score for the docID: 2147483647)

What is the bug?

When:

using multiple script_score neural queries on multiple (different) vector fields, like in this comment
each script references _score
explain=true

Then, if a document is returned by some neural field queries (within the sub-query's top-k) but not some others, the query fails with a script runtime exception and the error: Null score for the docID: 2147483647

(At least I think this is why... I'm new to OpenSearch and neural search, so apologies - my explanation for why this happens is just my best guess!)

How can one reproduce the bug?

Follow the docs instructions to set up neural search.
Set up two fields like title_embedding and description_embedding.
Ingest some documents (their embedding fields should by set in the ingest pipeline) - the example query below should have 100 documents
Run a query like:

GET /myindex/_search?explain=true
{
  "from": 0,
  "size": 100,
  "query": {
    "bool" : {
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "title_embedding": {
                  "query_text": "test",
                  "model_id": "xGbq_YcB3ggx1CR0Nfls",
                  "k": 10
                }
              }
            },
            "script": {
              "source": "_score * 1"
            }
          }
        },
        {
          "script_score": {
            "query": {
              "neural": {
                "description_embedding": {
                  "query_text": "test",
                  "model_id": "xGbq_YcB3ggx1CR0Nfls",
                  "k": 10
                }
              }
            },
            "script": {
              "source": "_score * 1"
            }
          }
        }
      ]
    }
  }
}

See an error like:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
          "org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
          "org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
          "_score * 1",
          "^---- HERE"
        ],
        "script": "_score * 1",
        "lang": "painless",
        "position": {
          "offset": 0,
          "start": 0,
          "end": 10
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "opensearch_content",
        "node": "vnyA5s-aQUOmTj6IHosYXA",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
            "org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
            "org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
            "_score * 1",
            "^---- HERE"
          ],
          "script": "_score * 1",
          "lang": "painless",
          "position": {
            "offset": 0,
            "start": 0,
            "end": 10
          },
          "caused_by": {
            "type": "runtime_exception",
            "reason": "Null score for the docID: 2147483647"
          }
        }
      }
    ]
  },
  "status": 400
}

Note the high size and low k. You might need to adjust the query_text or k to find a combination where a document is returned in one neural query's top k and not the other.

Remove explain=true from the query and notice it succeeds.

What is the expected behavior?

The query succeeds - it does not throw an error.
_score for the affected field is 0 or the affected field is excluded entirely - either way, the _explanation should accurately reflect this.

What is your host/environment?

OpenSearch 2.7, Ubuntu 22.04.

Do you have any additional context?

I'm not sure why it only happens with explain=true. (I can't explain it)

It also only happens if using script_score. If using multiple neural queries directly, there is no error. But then there is no score per-field in _explanation - the total is correct, but each field score value is reported as 1. opensearch-project/k-NN#875 describes this problem. My use case is: I'd like to try using the similarity scores of each field as features in a Learning to Rank model, which means I need to get each score individually.

Add retry mechanism for neural search inference

Is your feature request related to a problem?

In some edge cases e.g. the instance down and the inference requests either to ingest or to query can fail, adding retry can relief this issue dramatically.

What solution would you like?

Add basic retry mechanism in neural search inference client.

What alternatives have you considered?

Add complicated retry like backoff policy, jitter etc in the retry mechanism, but for our system it's an internal retry which means we know the how the retry will behave e.g. round robin or least load. So a basic retry is enough for our system.

Do you have any additional context?

Integrate ML Commons lib and K-NN Plugin

Description

As a part of Ingestion and Search we are dependent on ML Commons Lib to create the embeddings for query string and user document and K-NN plugin for doing K-NN search. This task tracks the integration of ML commons lib in Neural-Search Plugin.

K-NN Plugin: https://github.com/opensearch-project/k-NN
ML Commons: https://github.com/opensearch-project/ml-commons

Tasks

Integrate ML Commons Lib in Repo
Unit test for the Interface
Integrate K-NN plugin
Unit test for the interface

Support ANN filtering

In 2.4.0, the OpenSearch k-NN plugin implemented filtering for the lucene engine for OpenSearch queries: opensearch-project/k-NN#519.

Given that the neural query is a wrapper around the k-NN query type, we can add this feature into neural search as well. We would add filter subobject to our current query type.

opensearch-project / neural-search Goto Github PK

neural-search's People

Contributors

Stargazers

Watchers

Forkers

neural-search's Issues

Description

Benchmarking Search API

Relevance of Search

Appendix

This Component Release Issue

Release Steps

The Overall Release Issue

What should I do if my plugin isn't making any changes?

Preparation

CI/CD

Pre-Release

Release Testing

Release

Post Release

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

Problem Statement

Potential Solutions

Goal

1. Rely on index meta field

2. Make model map index settings

3. Use system index

4. Rely on model index association during model management

Requested Feedback

Introduction

Background

Requirements

Scope

Out of Document Scope

Solution Overview

Risks / Known limitations

Future extensions

Solution Details

Figure 1: Class diagram for Hybrid Query implementation

Figure 2: General sequence diagram for getting query results

API Interface

Query Builder

Query

Weight

Scorer

Plugin

Testability

Reference Links

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any additional context?

Is your feature request related to a problem?

Older Issues and Discussions:

Tasks

Community Requests

What is happening?

What do you need to do?

This Component Release Issue

Release Steps

The Overall Release Issue

What should I do if my plugin isn't making any changes?

Preparation

CI/CD

Pre-Release

Release Testing

Release

Post Release

Is your feature request related to a problem?