Git Product home page Git Product logo

fairsearch-fair-for-elasticsearch's People

Contributors

chatox avatar dependabot[bot] avatar ivankitanovski avatar milkalichtblau avatar purbon avatar tsuehr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fairsearch-fair-for-elasticsearch's Issues

Implement the M table generator as a utility class

Hi,
we should port the M table generator, as specified in the paper https://arxiv.org/abs/1706.06368, to be used as a lib.

There should be a class that hold the next set of core functionality.

  • Generate an M table based on given parameters (p, alpha, tries, etc.)

Acceptance criteria:

Given valid parameters for alpha, p and k.
When a user request an M table to be generated.
Then a valid M table is generated.

Deal with cases in which p and k are small, and alpha is large, so no re-ranking is necessary

For some combinations of k, p, alpha (for instance, k=10, p=0.1, alpha=0.05), the FA*IR test is always passed without the need of re-ranking.

Such a case can be detected by the AlphaAdjustment routine, which could throw a RerankingNotNeededException or somehow communicate that re-ranking will not be needed.

The output ranking should, if possible, include a variable saying that re-ranking was not done because it was not necessary.

Fix build against Elasticsearch 6.3.x

The current build fails when trying to run it against newer ES, by setting version_es = '6.3.1' in build.gradle, with an error like this:
A problem occurred evaluating root project 'fairsearch'.

Failed to apply plugin [id 'elasticsearch.esplugin']
Could not create task of type 'RestIntegTestTask'

This seems to be caused by the newer ES plugin in 6.3, which requires the idea gradle plugin to be first loaded, and can be easily fixed just by moving the applying of idea plugin before elasticsearch plugin, like this (lines 23-24 in build.gradle)
...
apply plugin: 'idea'
apply plugin: 'elasticsearch.esplugin'
...

Please check and include this fix if you find it ok/useful. (as I could not find a way to create a branch / open a PR directly).
Also, it would be nice then to have a pre-built artifact for 6.3.1 too :) (in the same repo as the artifacts for the older versions)

On a related note, this fixes it for ES 6.3.x. But for building against ES 6.4.x a further fix is needed: to upgrade Gradle to version 4.9+ , which in my case seemed to work ok without other deeper changes required to the build - so maybe consider including this change too.

Verify the case where k < windowsize

I understand in the current code the variable k can be specified independently of the window size of the re-ranker.

In particular, it must be k <= windowsize (if k > windowsize => exception).

Now, when k < windowsize, the M-Table should be applied up to position k. After that position, we do not verify if the number of protected elements is greater or equal than anything, because there is no M-Table. In other words, the M-Table applies to the first k elements, the remaining elements are returned in the order in which they come, independently on whether they are protected or not.

This issue requires:

  1. @purbon to certify that indeed k can be < windowsize

  2. @tsuehr to check that the re-ranker works correctly when it receives more elements than the size of the M-Table

Fairsearch as an elasticsearch rescore function.

The first and straight forward way to integrate the fair search methodology in an elasticsearch plugin is to offer the algorithm as a [rescore] function. (https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-rescore.html).

Given a query like this:

GET studies/_search
{
  "query": {
    "match": {
      "center": "Berlin"
    }
  },
  "rescore": {
    "window_size": 100,
    "fair_rescorer" : {
      "protected_key" : "gender",
      "protected_value" : "Female"
    }
  }
}

The system will execute the query

  "query": {
    "match": {
      "center": "Berlin"
    }

and then apply the rescore function to raise the score of top fair results.

This functionality will allow an initial, clean and straightforward implementation implementation that will provide initial high value to users of this plugin.

Depends On: #1

Distribution method

Elasticsearch plugins are usually distributed from a webserver, we should be able to provide a URL that people can use to download and access this plugin.

Remove or document the restriction that index must have at most 1 shard + 1 replica for the plugin to work

There is a hard validation in code (in FairRescoreBuilder.java) that checks that the index has only 1 shard (+1 replica).
This means that the plugin is unusable for bigger indexes, which for performance (and/or reliability) reasons need more than 1 shard (or 1 replica) when deployed (like in a production environment).

If this is an unavoidable limitation of the algorithm or the elasticsearch api used by it, please document it clearly in the readme / userguide (currently there is no mention about this), as for many potential users I think this is a very important consideration, which should be known upfront.

If on the other hand this is an avoidable limitation, and the code can be changed/improved so it works with multiple shards, then please consider improving it in this direction.

M table calculations should be cached in elasticsearch

M table calculations are expensive to compute, for this reason, they should be cached internally in elasticsearch.

We should be caching this as documents in an internal special index.

For this we should create an special REST endpoint, together with it's related transport actions. This endpoint will be responsible for:

  • Trigger the calculation of a new M Table based on the given set of parameters.
  • Delete an stored m table.
  • Trigger an update of an already calculated M table.

Endpoints draft proposal

POST _fairness_table/_create
{
   "model" : {
          "name": "foo",
          "proportion": 3,
          "alpha": 0.1,
          "table": [[1,2,3],[4,5,6]] 
    }
}
the _table_ param is optional, if not set the table execution will calculated on demand, otherwise will be gotten from the parameter.

we use the same, but with PUT to trigger an update.

GET _fairness_table/foo

DELETE _fairness_table/foo



depends on: #1 

Catch shard failure and throw reasonable exception when doing a fair query against a non-existing mtable

Executing
POST /test/_search
{"query" : {
"match" : {
"body" : "hello"
}
},
"rescore" : {
"fair_rescorer" : {
"protected_key" : "gender",
"protected_value" : "f",
"significance_level" : 0.1,
"min_proportion_protected" : 0.8
}
}
}

without doing before

POST /_fs/_mtable/0.8/0.1/10

will lead to All shards failed error within elasticsearch.
Some error message like "mtable does not exist" would be better

Configuration options for the plugin

As a user I would love to set up static configuration values as plugins settings, this options are values that will define how the fair query plugin will operate.

Draft proposal


proportion_strategy = fixed / variable
Default fixed. The minimum proportion of protected attributes in every prefix.

significance_level = (float number)
Default 0.1. The significance level for the statistical test.

on_too_few_protected_elements= abort / proceed
Default proceed. What to do if there are less protected elements than the target.

min_proportion_protected = (float number)
Default 0.5. The minimum proportion of protected attributes at every prefix

lookup_for_measuring_proportion = (integer number)
Default 100. The number of top elements that are examined to determine the target proportion of protected elements.

As you can see in the elasticsearch documentation, there will be the option to change this settings in runtime.


Objective: To agree on this issue about our first set of configuration options for the plugin.

Proposals, enhancements and discussions are very welcome.

FairRescoreBuilder: Queue construction for fair Rescore should be reviewed

int max = Math.min(topDocs.scoreDocs.length, rescoreContext.getWindowSize());
at line 277 in FairRescoreBuilder.java will fail at the following example:
Docs in index =25
WindowSize = 10
k=10
Top 9 docs are nonprotected
We will only have one protected element there
for p= 0.8 and alpha = 0,1 we will need 6 protected elements

Fairsearch as query string

As a user of this plugin I am to directly execute the query as:

GET studies/_search
{
  "query": {
    "fair_query_string": {
      "query": "Berlin",
       "default_field" : "content",
      "default_operator" : "OR",
      "protected_key" : "gender",
      "protected_value" : "Female"
    }
  }
}

this query will allow the users to:

Given a query string (similar functionality as in https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
Retrieve the fair list of search results based on the given parameters.

the proposed list of parameters are:

  • The ones available in the elasticsearch query string.
  • protected_key: to specify the protected attributed used by the algorithm.
  • protected_value: the value used to define the protected class.

Once #3 is finished, this query will be the next logical step that provide direct user value.

Depends On: #3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.