o19s / elasticsearch-learning-to-rank Goto Github PK

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch

Home Page: http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/

License: Apache License 2.0

Java 97.24% Python 2.76%

elasticsearch relevant-search machine-learning search-relevance elasticsearch-plugin elasticsearch-plugins

elasticsearch-learning-to-rank's Introduction

The Elasticsearch Learning to Rank plugin uses machine learning to improve search relevance ranking. It's powering search at places like Wikimedia Foundation and Snagajob!

What this plugin does...

This plugin:

Allows you to store features (Elasticsearch query templates) in Elasticsearch
Logs features scores (relevance scores) to create a training set for offline model development
Stores linear, xgboost, or ranklib ranking models in Elasticsearch that use features you've stored
Ranks search results using a stored model

Where's the docs?

We recommend taking time to read the docs. There's quite a bit of detailed information about learning to rank basics and how this plugin can ease learning to rank development.

You can also participate in regular trainings on Elasticsearch Learning to Rank, which support the free work done on this plugin.

I want to jump in!

The demo lives in another repo now, Hello LTR and it has both ES and Solr example. Follow the directions for Elasticsearch in the README to set up the environment and start with the notebooks/elasticsearch/tmdb/hello-ltr.ipynb. Have fun!

Installing

See the full list of prebuilt versions and select the version that matches your Elasticsearch version. If you don't see a version available, see the link below for building or file a request via issues.

To install, you'd run a command like this but replacing with the appropriate prebuilt version zip:

./bin/elasticsearch-plugin install https://github.com/o19s/elasticsearch-learning-to-rank/releases/download/v1.5.4-es7.11.2/ltr-plugin-v1.5.4-es7.11.2.zip

(It's expected you'll confirm some security exceptions, you can pass -b to elasticsearch-plugin to automatically install)

If you already are running Elasticsearch, don't forget to restart!

Known issues

As any other piece of software, this plugin is not exempt from issues. Please read the known issues to learn about the current issues that we are aware of. This file might include workarounds to mitigate them when possible.

Build and Deploy Locally

Notes if you want to dig into the code or build for a version there's no build for, please feel free to run the build and installation process yourself:

./gradlew clean check
./bin/elasticsearch-plugin install file:///path/to/elasticsearch-learning-to-rank/build/distributions/ltr-<LTR-VER>-es<ES-VER>.zip

How to Contribute

For more information on helping us out (we need your help!), developing with the plugin, creating docs, etc please read CONTRIBUTING.md.

Elastic Release Support

We do our best to officially support *.*.1 releases of Elasticsearch. If you have a need for "dot-oh" compatibility or a version we don't support please consider submitting a PR.

Who built this?

Initially developed at OpenSource Connections.
Significant contributions by Wikimedia Foundation, Snagajob Engineering, Bonsai, and Yelp Engineering
Thanks to Jettro Coenradie for porting to ES 6.1

Other Acknowledgments & Stuff To Read

Bloomberg's Learning to Rank work for Solr
Our Berlin Buzzwords Talk, We built an Elasticsearch Learning to Rank plugin. Then came the hard part
Blog article on How is Search Different from Other Machine Learning Problems
Also check out our other relevance/search thingies: book Relevant Search, projects Elyzer, Splainer, and Quepid

elasticsearch-learning-to-rank's People

Contributors

Stargazers

Watchers

Forkers

sstults beifeizhou xwang-saj snagajob bayesquant allensmile snazz2001 benjamesbabala weizhili-relfektion tomzhang daiweireflektion anandnalya kaikim node nomoa vganapathy1 mindis levylll hotwater1015 justintung wuce7758 linecode guyueyuqi hardikgw smarthi tornadozou searchgame zhumeilian longshang mylearning2017 hscells smartan chubakka2016 ajoeajoe loyaltyji daniel-007 boeingx haifabenaouicha zofuthan wonyonyon wolfpi yexudart worleydl twenty-seventeen rickzhanghw baiyuang kangkot besson gshlsh17 shobhit-agarwal tanyinyan lucentcosmos joomcode ebjune ebernhardson zjpjohn liang68 shejianmin kyhoolee machinelearninghelen eruditepanda jkowalewski durgeshiitj mikalv alexanderdaw thejanw hal2001 yiliaofan desperado1992 sharat7j antoinefranz purbon rock999 sully90 clustersdata jettro evilcoda wangjieboy rcourivaud zorrock mdavid-stc andy-wagner bigdataboutique jadielam alejsalazar aamirnakhwa naheedmk tianpeng341204 huifeideyua1231 evenoldridge jontxu70 hhxx2015 fan158 cntrigkog onzali wangyongxuan number0 changquanyou bigdataedison songfj

elasticsearch-learning-to-rank's Issues

ltr-query-0.1.2-es5.5.2.zip and ltr-query-0.1.2-es5.5.1.zip downloads fail

The prebuilt versions page is throwing a 403 when attempting download these versions from http://es-learn-to-rank.labs.o19s.com/:
ltr-query-0.1.2-es5.5.1.zip
ltr-query-0.1.2-es5.5.2.zip

Are these plugin versions no longer available?

Document v1.0 features

Update README and any other documentation to reflect v1.0 features. What other documentation do we need?

Feature Store API: Add ability to derive features

Minor HTTP status code nitpicks in 1.0

I noted when rebuilding the demo that

Trying to GET a missing feature store (to check if it exists) returns a 400, not a 404
PUT'ing a new feature store should return 201, instead it returns 200

Plugin Release for Elasticsearch 5.4.2

Hi there,

Are there any plans to release the plugin for Elasticsearch version 5.4.2 soon?

Demonstrate xgboost usage

Apparently xgboost is the shiznit for training boosted trees. We should demonstrate how xgboost can be used with this plugin, with existing Ranklib XML method of specifying a LambdaMART model.

I would expect

scripts directory showing off xgboost
or another git repo demonstrating xgboost + ES w/ this plugin
or a helpful blog post on xgboost + ES

Update demo for v1.0

The demo scripts need to be updated for 1.0 and tested

Feature Store API: Add ability to group features as feature sets

Getting 403 Forbidden on 5.3.0 download link

curl -I http://es-learn-to-rank.labs.o19s.com/ltr-query-0.1.1.es5.3.0.zip

HTTP/1.1 403 Forbidden
x-amz-error-code: AccessDenied
x-amz-error-message: Access Denied
x-amz-request-id: E871A1B78B77A821
x-amz-id-2: uLiGTk/zq5ptDSP/bR5t5ZAziB15saBMCVBkEiby/KsNKRy9n0igclrputPSRTE4/Kyt6Yjwiuo=
Transfer-Encoding: chunked
Date: Thu, 06 Apr 2017 18:33:53 GMT
Server: AmazonS3

version in build.gradle in 1_0 branch is behind version in master

@nomoa We're trying to get 1.0 up and running over here and I noticed that the version number in the build spec is behind the version in master.

Should we update to 1.0.0-es5.3.0-SNAPSHOT or something similar to avoid confusion?

Add full example for simple testing

It would be good to have a working example data set and set of click data to have a simple smoke test for devs to run.

I think what is needed is:

A set of index data. Settings, mappings, and json docs
Scripts to load the data. Here's some bash scripts I've used for stuff like this: https://gist.github.com/gibrown/b039039666e387ed6b0dcefb45203420
Click data for building the ranklib file (or just a ranklib file)
A test query

This is also mentioned in the current README, just figured opening an issue would help.

Create build for ES 2.4

There are a number of barriers to getting ES 5.x deployed (for us it is partly that we are still working to get off of 1.x and supporting three major versions in production would require even more hacks). If it's possible to get a 2.x version of this plugin that would make it a lot easier to experiment with existing indices and should increase the user base of people contributing.

No support for empty feature sets

I tend to see the functionality here as a "workbook" approach to developing features. I can see a case where someone creates a feature set and wants to add more later. So I was surprised to see this command

PUT _ltr/_featureset/more_movie_features
{
  "name": "more_movie_features",
  "features": []
}

Failed with

{
   "error": {
      "root_cause": [
         {
            "type": "parsing_exception",
            "reason": "At least one feature must be defined in [features]",
            "line": 4,
            "col": 1
         }
      ],
      "type": "parsing_exception",
      "reason": "At least one feature must be defined in [features]",
      "line": 4,
      "col": 1
   },
   "status": 400
}

I suspect there's a reason for this, but want to understand why for documentation. But if it's possible, it'd be great to allow empty feature sets and avoid the confusion.

Feature Store API: Hints for features being static (cacheable)

Random test failures

com.o19s.es.ltr.feature.store.index.CachedFeatureStoreTests.testExpirationOnGet is failing randomly and probably others that rely on system clock.
We need to find a way to make it run nicely or disable it to avoid annoyance.

Model Language: Change from ordinals to names

The names may refer to search templates.

RankLib install

The documentation about installing the RankLib jar talks about using maven, but the maven command won't work without a pom. More instructions on installing the jar would be helpful.

Remove Ranklib dependency

how to add document features and Query/User-Dependent Features?

How to add document features and Query/User-Dependent Features both for the training stage and the reranking stage? can anyone give a demo during the use of the plugin

Is there a way to see a list of all releases?

I get a 403 trying to visit: http://es-learn-to-rank.labs.o19s.com/

Additionally, in the instructions you say to visit the URL

http://es-learn-to-rank.labs.o19s.com/ltr-query-0.0.5-es<ES VER>.zip but there are no hints to the format of the ES VER. Is it 5.1 or 5.1.0?

I'm currently playing whack-a-mole to find the supported versions.

Thank you.

Cant log boosted sltr query

To address #69, (so I can log features for a set of ids), I added an sltr query with a boost of 0. Specifically:

GET tmdb/_search
{
    "explain": true, 
    "query": {
        "bool": {
            "should": [                
                {"sltr": {
                    "boost": 0,
                    "_name": "logged_featureset",
                    "featureset": "movie_features",
                    "params": {
                        "keywords": "rambo"
                    }
                }},
                {"match": {
                   "overview": 
                   {
                       "query": "rambo"
                   }
                }}
                ]
            }
    },
    "ext": {
        "ltr_log": {
            "log_specs": {
                "name": "log_entry1",
                "named_query": "logged_featureset"
            }
        }
    }
}

This gives the following error:

{
   "error": {
      "root_cause": [
         {
            "type": "illegal_argument_exception",
            "reason": "Query named [logged_featureset] must be a [sltr] query [BoostQuery] found"
         }
      ],
      "type": "search_phase_execution_exception",
      "reason": "all shards failed",
      "phase": "query",
      "grouped": true,
      "failed_shards": [
         {
            "shard": 0,
            "index": "tmdb",
            "node": "A1yBd5opScyEeVTitmkyCA",
            "reason": {
               "type": "illegal_argument_exception",
               "reason": "Query named [logged_featureset] must be a [sltr] query [BoostQuery] found"
            }
         }
      ]
   },
   "status": 400
}

Feature Store API: Store all/most as search templates

May not be possible. Let's keep track of the roadblocks if that's the case so that we might revisit later.

Feature/Feature Set relationship confused somewhat in API

Initially when I developed the demo, I had assumed from the structure of this request:

{
  "name": "my_feature_set",
  "features" : [
    {
      "name": "my_feature",
      "params": ["query_string"],
      "template_language": "mustache",
      "template" : {
        "match": {
          "field": "{{query_string}}"
        }
      }
    }
  ]
}

That features were "owned by" a feature set. IE a strong composition relationship. This is further reinforced by requiring feature sets to be created with new features (see #82). For example, in this sort of relationship I assumed if you delete a feature set, you would also delete all associated features. Or you would do GET _ltr/my_feature_set/feature/my_feature.

However, it appears on further study that features exist independently of feature sets, and the relationship is more associative (which makes a lot of sense to me).

Can we change this API to avoid the confusion by creating a clearer associative relationship? For example, I would suggest something that did not automatically create features in the set, such as notionally something like:


PUT _ltr/_feature/my_feature
   {
      "name": "my_feature",
      "params": ["query_string"],
      "template_language": "mustache",
      "template" : {
        "match": {
          "field": "{{query_string}}"
        }
      }
   }

PUT _ltr/_featureset
{
  "name": "my_feature_set",
  "features" : [{
      "name": "my_feature"
   }]
}

I could live with the current way things are done, but as I'm documenting I'm seeing how it can be confusing.

At the very least, I think it would be good to be able to support a feature set creation syntax like what I've proposed above where feature sets are created from existing features. (maybe one exists and I missed it?).

Thoughts? Am I missing something?

Feature Store API: Add ability to assign unique name to features

Consider changing the error message/code when attempting to update model

Models are intended to be immutable. However, if you do try to update a model you get a fairly cryptic error.

Performing:

POST http://localhost:9200/_ltr/_featureset/movie_features/_createmodel

Returns 409 with

{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[store][model-test_9]: version conflict, document already exists (current version [1])","index_uuid":"WnhIFFEMTuyTwLXu-DVmxw","shard":"0","index":".ltrstore"}],"type":"version_conflict_engine_exception","reason":"[store][model-test_9]: version conflict, document already exists (current version [1])","index_uuid":"WnhIFFEMTuyTwLXu-DVmxw","shard":"0","index":".ltrstore"},"status":409}

I would propose a 405 with a helpful message "Models cannot be updated, please create a new model" or something similar

Scripted feature queries returning a value > 1 are passed to the LTR reranker as 1.0

To reproduce create the following index:

PUT /rando

PUT /rando/_mapping/fortune 
{
  "properties": {
    "msg": {
      "type": "text"
    },
    "lucky_number": {
      "type": "float"
    }
  }
}

PUT /rando/fortune/1
{
  "msg": "Be patient: in time, even an egg will walk.",
  "lucky_number": 0.9
}

PUT /rando/fortune/2
{
  "msg": "Let the deeds speak.",
  "lucky_number": 2.2
}

PUT /rando/fortune/3
{
  "msg": "Digital circuits are made from analog parts.",
  "lucky_number": 3.3
}

GET /rando/_search
{
  "query": {
    "match_all": {}
  }
}

Load the following model (with a lucky_number split threshold of 0.99 )

POST _scripts/ranklib/testmodel
{
  "script": "## LambdaMART\n## No. of trees = 1\n## No. of leaves = 2\n## No. of threshold candidates = 1\n## Learning rate = 0.1\n## Stop early = 100\n\n<ensemble><tree id=\"1\" weight=\"0.1\"><split><feature> 1 </feature><threshold> 0.99 </threshold><split pos=\"left\"><output>5</output></split><split pos=\"right\"><output>10</output></split></split></tree></ensemble>"
}

And run the scoring query

GET /rando/_search
{
    "query": {
        "ltr": {
            "model": {
                "stored": "testmodel"
            },
            "features": [{
                "script": {
                  "script": {
                    "lang": "expression",
                    "inline": "doc['lucky_number']"
                  }
                }
            }]
        }
    },
    "script_fields": {
      "1": {
        "script" : {
          "lang": "expression",
          "inline" : "doc['lucky_number']"
        }
      }
    },
    "_source":true
}

As you'd expect fortune-1 takes the left-split, and fortune-2 and 3 take the right-split.

Now reload the same model but modify the lucky_number split threshold to be 2.5

POST _scripts/ranklib/testmodel
{
  "script": "## LambdaMART\n## No. of trees = 1\n## No. of leaves = 2\n## No. of threshold candidates = 1\n## Learning rate = 0.1\n## Stop early = 100\n\n<ensemble><tree id=\"1\" weight=\"0.1\"><split><feature> 1 </feature><threshold> 2.5 </threshold><split pos=\"left\"><output>5</output></split><split pos=\"right\"><output>10</output></split></split></tree></ensemble>"
}

And re-run the scoring query above.

(Expectation fortune-1 and fortune 2 take the left split, while fortune-3 takes the right split)

However: All three end up taking the left split despite the fact that 3.3 > 2.5

(This behavior seems to indicate that RankLib is receiving min(script_computed_value, 1.0) ... as opposed to the explicit script_computed_value)

Note: I validated this standalone with RankLib directly and the test as structured above was successful.

Build failed with gradle

Hi guys, I can not build the project with gradle, and it gives this error:

FAILURE: Build failed with an exception.

Where:
Build file '/file/elasticsearchLtrPlugin/elasticsearch-learning-to-rank/build.gradle' line: 18

What went wrong:
A problem occurred evaluating root project 'ltr-query'.

Failed to apply plugin [id 'carrotsearch.randomized-testing']
Could not create task of type 'RandomizedTestingTask'.

Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Do you have any suggestions?

Many thanks,
Beifei

String keys for feature-generating queries

In the current implementation feature-generating queries are matched to features in a trained model based on their position in the ltr subquery.

This is extremely terse, and introduces many opportunities to introduce positional mismatches (during query development, or model development), without a validation step.

What could help, at least from a query-development perspective would be something similar to this:

http://alexbenedetti.blogspot.com/2016/08/solr-is-learning-to-rank-better-part-2.html

Where features are identified with a string-key in the query. And the ordering/alignment of feature-inputs against the model is specified in a separate block from the feature-inputs themselves.

Significantly, then those feature description string-keys could then be carried through to the _explain (in place of "Feature 7:") for more readable validation.

To avoid costly errors, validate features when they're added

If you have valid JSON, but a semantically invalid query (for example it's easy to add "query") like below when the plugin expects things at the "match" level:

 {
      "name": "user_rating",
      "type": "feature",
      "feature": {
         "name": "user_rating",
         "params": [],
         "template_language": "mustache",
         "template": {
            "query": {
               "match": {
                  "title": "{{keywords}}"
               }
            }
         }
      }
   }

Say this is feature "1" and you've built up a large feature set of 100 queries. Then you go to execute the feature set. You'll get a bug that the sltr query could not be executed.

As feature sets are append-only your whole feature set is screwed up and you have to start over. Is there any way to validate this is a valid query (ie attempt to parse it) before letting the feature be cerated

Start not recognized when listing features/feature sets

Run this command from the docs:

GET /_ltr/_featureset?prefix=set&start=20&size=30

You'll get the error:

{
   "error": {
      "root_cause": [
         {
            "type": "illegal_argument_exception",
            "reason": "request [/_ltr/_featureset] contains unrecognized parameter: [start]"
         }
      ],
      "type": "illegal_argument_exception",
      "reason": "request [/_ltr/_featureset] contains unrecognized parameter: [start]"
   },
   "status": 400
}

The query example does not work

Hi,

I have used this query example, however, it did not work.

{
    "query": {...}
    "rescore": {
        "query": {
            "ltr": {
                "model": {
                    "stored": "dummy"
                },
                "features": [{
                    "match": {
                        "title": userSearchString
                    }
                },{
                    "constant_score": {
                        "query": {
                            "match_phrase": {
                                "title": "userSearchString"
                            }
                        }
                    }
                }]
            }
        }
    }
}

It seems that the query is wrong, and a "," seems to be missing after "query":{...}. After I added ",", and run the query, it gives such error:

{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "[query] unknown field [ltr], parser not found"
}
],
"type": "illegal_argument_exception",
"reason": "[query] unknown field [ltr], parser not found"
},
"status": 400
}

I have checked the source code, in this class "LtrQueryBuilder", it does not mention "rescore", does it mean that "rescore" is not implemented?

I am looking forward to your reply.
Thanks,
Beifei

Script evaluated differently in LTR?

I have a sample painless script which compares two static dates. If I run it inside script_fields I see the value I expect. However if I run it as a script query input to LTR, I get 1.

Baffled!

Test Setup:

POST _scripts/ranklib/dummy
{
  "script": "## LambdaMART\n## No. of trees = 1\n## No. of leaves = 10\n## No. of threshold candidates = 256\n## Learning rate = 0.1\n## Stop early = 100\n\n<ensemble>\n <tree id=\"1\" weight=\"0.1\">\n  <split>\n   <feature> 1 </feature>\n   <threshold> 0.45867884 </threshold>\n   <split pos=\"left\">\n    <feature> 1 </feature>\n    <threshold> 0.0 </threshold>\n    <split pos=\"left\">\n     <output> -2.0 </output>\n    </split>\n    <split pos=\"right\">\n     <output> -1.3413081169128418 </output>\n    </split>\n   </split>\n   <split pos=\"right\">\n    <feature> 1 </feature>\n    <threshold> 0.6115718 </threshold>\n    <split pos=\"left\">\n     <output> 0.3089442849159241 </output>\n    </split>\n    <split pos=\"right\">\n     <output> 2.0 </output>\n    </split>\n   </split>\n  </split>\n </tree>\n</ensemble>"
}  

POST /test/empty/
{}
    
GET /test/empty/_search
{
  "query": {
    "match_all": {}
  }
}

Query to Reproduce Error:

GET /test/empty/_search?explain=true
{
  "query": {
      "match_all": {}
  },
  "script_fields": {
      "days_between": {
          "script": {
              "params": {
                  "search_timestamp": "2017-03-23T00:00:00.000Z",
                  "compare_to": "2017-03-18T04:34:15.606Z"
              },
              "lang": "painless",
              "inline": "return ChronoUnit.DAYS.between(Instant.parse(params.compare_to), Instant.parse(params.search_timestamp))"
          }
      }
  },
  "rescore": {
      "query": {
          "rescore_query": {
              "ltr": {
                  "model": {
                      "stored": "dummy"
                  },
                  "features": [
                      {
                          "script": {
                              "_name": "days_between",
                              "script": {
                                "params": {
                                    "search_timestamp": "2017-03-23T00:00:00.000Z",
                                    "compare_to": "2017-03-18T04:34:15.606Z"
                                },
                                "lang": "painless",
                                "inline": "return ChronoUnit.DAYS.between(Instant.parse(params.compare_to), Instant.parse(params.search_timestamp))"
                            }
                          }
                      }
                  ]
              }
          }
      }
  }
}

For me (5.3.0_0.1.0) the result comes out:

fields.days_between = [ 4 ]
_explaination.details[1].details[0].details[0].details[0].value = 1

Release candidate 1.0 for ES 5.5.x

Build Elasticsearch 1.0 for ES 5.5.x as a release candidate

[bug report]train.py doesn't use esUrl

esUrl="http://localhost:9200"
es = Elasticsearch(timeout=1000)

esUrl is never used.

Query validation

Can we use the existing validation endpoint?
Should or can we make sure that nothing is deprecated?

Execute LTR queries without specifying features

Pass a model
Needs to be parameterized to include whatever the templates require
- Query string
- Location
- ...

can I use feature which not in es?

Support stored features in FeatureSet creation

I'm working on improving the documentation for utilizing the REST api to manage features and feature sets. Currently I don't see a way to utilize existing features when creating a new feature set and we don't allow creating a feature set with no features. We should support creating a featureset by using the names of existing features.

Logging response include ordinal

As discussed with @nomoa, plain-old RankLib only thinks in ordinals. Should we change the log response to optionally include the ordinal of the feature?

publish maven central?

can deploy to maven central?

Debug explain for ltr query

The ltr query currently doesn't have a debug explain, it would be nice if one were added

Kill the Ranklib dependency

Related to #14. Scientist say we only use 1% of our Brains err I mean Ranklib.

So when you focus on what actually matters, we don't really care about most of the models. We don't really care about the training side of Ranklib. And when you take all that stuff away you get a very small set of code for evaluating some models.

This issue is to kill the dependency and just directly include the models/code we care about

A new feature for using the first phase scores in the second phase in the Ltr query

Hi all,

I think it would be a good feature to be able to use first phase query scores as a feature in the Ltr query. Solr Ltr integration has this feature, called OriginalScoreFeature, specially designed for this purpose. It seems that they have achieved this by customizing the QueryRescorer in Solr and passing this info as DocInfo. However, in elasticsearch, it does not seem to be able to customize QueryRescorer.

regards
Rifat

403 Error to download RC1 for ES5.5.2

Hi, could you please check the access permission to download:
http://es-learn-to-rank.labs.o19s.com/ltr-query-1.0.0-RC1-es5.5.2.zip

thank you

Would be helpful to output all feature-values with matching documents in the response.

This would make the feature-value inputs available to be logged (as future training data) without having to re-evaluate each feature query as a script_field.

(Those weights could simply be output in the response)

E.g. Instead of having to do

GET /rando/_search
{
    "query": {
        "ltr": {
            "model": {
                "stored": "testmodel"
            },
            "features": [{
                "script": {
                  "script": {        
                    "lang": "expression",
                    "inline": "doc['lucky_number']"
                  }
                }
            }]
        }
    },
    "script_fields": {
      "1": {
        "script" : {
          "lang": "expression",
          "inline" : "doc['lucky_number']"
        }
      }
    },
    "_source":true
}

to get:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.5,
    "hits": [
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "2",
        "_score": 0.5,
        "_source": {
          "msg": "Let the deeds speak.",
          "lucky_number": 2.2
        },
        "fields": {
          "1": [
            2.2
          ]
        }
      },
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "1",
        "_score": 0.5,
        "_source": {
          "msg": "Be patient: in time, even an egg will walk.",
          "lucky_number": 0.9
        },
        "fields": {
          "1": [
            0.9
          ]
        }
      },
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "3",
        "_score": 0.5,
        "_source": {
          "msg": "Digital circuits are made from analog parts.",
          "lucky_number": 3.3
        },
        "fields": {
          "1": [
            3.3
          ]
        }
      }
    ]
  }
}

Could we generate the same output by running this?:
(perhaps even incorporating query _names for clarity?)

GET /rando/_search
{
    "query": {
        "ltr": {
            "output_feature_values": true,
            "model": {
                "stored": "testmodel"
            },
            "features": [{
                "script": {
                  "_name": "lucky",
                  "script": {        
                    "lang": "expression",
                    "inline": "doc['lucky_number']"
                  }
                }
            }]
        }
    }
    "_source":true
}

to produce something like this?

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.5,
    "hits": [
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "2",
        "_score": 0.5,
        "_source": {
          "msg": "Let the deeds speak.",
          "lucky_number": 2.2
        },
        "feature_values": [{
          "lucky": 2.2
        }]
      },
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "1",
        "_score": 0.5,
        "_source": {
          "msg": "Be patient: in time, even an egg will walk.",
          "lucky_number": 0.9
        },
        "feature_values": [{
          "lucky": 0.9
        }]
      },
      {
        "_index": "rando",
        "_type": "fortune",
        "_id": "3",
        "_score": 0.5,
        "_source": {
          "msg": "Digital circuits are made from analog parts.",
          "lucky_number": 3.3
        },
        "feature_values": [{
          "lucky": 3.3
        }]
      }
    ]
  }
}

Change logging so that it doesn't refer to an sltr query

Consider "offline" logging use cases where a user batches a set of identifiers and simply wants the scores for each feature for a set of document identifiers (this is what happens currently in the demo). The current logging API expects to find an sltr query in the body.

At first blush, I would prefer a logging interface closer too:

GET tmdb/_search
{
    "query": {
        "terms": {
             "id": ["1234", "5678"]
        }
    },
    "ext": {
        "ltr_log": {
            "log_specs": {
                "featureset": "my_feature_set"
            }
        }
    }
}

This would log "my_feature_set" for the returned documents.

This seems more flexible than the current logging interface in the 1.0 branch, as it would support several logging use cases.

Support Fewer Models (slim down to handful of Ranklib models)

Erik Bernhardson makes a good point that really there's only a few types of models you probably care about. Ranklib comes with all sorts of intermediate/weird models that most people probably don't care about.

Maximum Interpretability

Models, like the linear model, that are basically optimized boosts. Advantage: it's easy to interpret, debug, and understand these models

Maximum Flexibility / Prediction Power

LambdaMART/Boosted Tree model that does a great job of predicting the weird nooks and crannies of search relevance

Related, we probably don't need to really rely on Ranklib when we just care about these 2 or 3 models, and only the evaluation parts of those models as well

If you forget _ before a feature or feature set, you can create insidous/confusing bugs

I was about to create an issue with the list of problems below. Notice the subtle problem: I accidentally created a feature store name "feature" when I had intended to create a feature in the default feature store. Then was confused why I couldn't find my feature. (note some below may still be bugs, I'm going to create separate issues):

May I suggest, that the following are blacklisted feature store names:

feature
featureSet
feature_set
featureset
feature*
(others?)

==================
Original bug I was about to file

Attempting to use the CRUD API to create/list features and I encounter a number of bugs (using the latest 1_0):

Features don't appear to be listed

PUT _ltr/feature
{
  "name": "foo",
  "params": ["query_string"],
  "template_language": "mustache",
  "template" : {
    "match": {
      "field": "{{query_string}}"
    }
  }
}

GET _ltr/_feature?prefix=foo

The latter GET returns 0 results for me. Additionally, I have simple features "1" and "2" for my demo, that I create when I create the feature set that do not return.

Start not recognized when retrieving features or feature sets

Running the example from the docs

GET /_ltr/_featureset?prefix=set&start=20&size=30

Gives error about start:

{
   "error": {
      "root_cause": [
         {
            "type": "illegal_argument_exception",
            "reason": "request [/_ltr/_featureset] contains unrecognized parameter: [start]"
         }
      ],
      "type": "illegal_argument_exception",
      "reason": "request [/_ltr/_featureset] contains unrecognized parameter: [start]"
   },
   "status": 400
}

Error about "store"

After running the code above, I attempt to retrieve "foo" with the following error

GET _ltr/feature/foo

{
   "error": {
      "root_cause": [
         {
            "type": "illegal_argument_exception",
            "reason": "request [/_ltr/feature/foo] contains unrecognized parameter: [store]"
         }
      ],
      "type": "illegal_argument_exception",
      "reason": "request [/_ltr/feature/foo] contains unrecognized parameter: [store]"
   },
   "status": 400
}

The same error happens if I atet

Cannot append to existing feature set

After creating foo above, I attempt to append to my existing feature set

POST /_ltr/_featureset/movie_features/_addfeatures/foo

Returns

{
   "error": {
      "root_cause": [
         {
            "type": "illegal_argument_exception",
            "reason": "The feature query [foo] returned no features"
         }
      ],
      "type": "illegal_argument_exception",
      "reason": "The feature query [foo] returned no features"
   },
   "status": 400
}

Provide a way to run circleci on PR

Currently CI runs only on branches of this repo. It'd be interesting for contributors that do not have write access to this repo to be able to run circleci on their PR.
Additionally it'll help reviewers to make sure that the build pass before merging the PR without fetching the PR locally.

Simplifying/hiding feature stores

My first (perhaps incorrect) impression of feature stores is that they are an implementation detail most users would not think about. Are there cases where people would create more than one feature store? Or can a single feature store satisfy the vast majority of use cases?

Can we

Initialize the default feature store on plugin installation?
Hide this implementation detail (don't document creating other feature stores?)