fair-search / fairsearch-fair-for-elasticsearch Goto Github PK
View Code? Open in Web Editor NEWFair search elasticsearch plugin
License: Apache License 2.0
Fair search elasticsearch plugin
License: Apache License 2.0
Hi,
we should port the M table generator, as specified in the paper https://arxiv.org/abs/1706.06368, to be used as a lib.
There should be a class that hold the next set of core functionality.
Acceptance criteria:
Given valid parameters for alpha, p and k.
When a user request an M table to be generated.
Then a valid M table is generated.
For some combinations of k, p, alpha (for instance, k=10, p=0.1, alpha=0.05), the FA*IR test is always passed without the need of re-ranking.
Such a case can be detected by the AlphaAdjustment routine, which could throw a RerankingNotNeededException or somehow communicate that re-ranking will not be needed.
The output ranking should, if possible, include a variable saying that re-ranking was not done because it was not necessary.
The current build fails when trying to run it against newer ES, by setting version_es = '6.3.1' in build.gradle, with an error like this:
A problem occurred evaluating root project 'fairsearch'.
Failed to apply plugin [id 'elasticsearch.esplugin']
Could not create task of type 'RestIntegTestTask'
This seems to be caused by the newer ES plugin in 6.3, which requires the idea gradle plugin to be first loaded, and can be easily fixed just by moving the applying of idea plugin before elasticsearch plugin, like this (lines 23-24 in build.gradle)
...
apply plugin: 'idea'
apply plugin: 'elasticsearch.esplugin'
...
Please check and include this fix if you find it ok/useful. (as I could not find a way to create a branch / open a PR directly).
Also, it would be nice then to have a pre-built artifact for 6.3.1 too :) (in the same repo as the artifacts for the older versions)
On a related note, this fixes it for ES 6.3.x. But for building against ES 6.4.x a further fix is needed: to upgrade Gradle to version 4.9+ , which in my case seemed to work ok without other deeper changes required to the build - so maybe consider including this change too.
The generation of table M should be covered by unit tests varying p, alpha, and k.
Typical values of alpha: 0.1, 0.2
Values of k: 50, 100, 200
Values of p: 0.4, 0.5, 0.6
I understand in the current code the variable k can be specified independently of the window size of the re-ranker.
In particular, it must be k <= windowsize (if k > windowsize => exception).
Now, when k < windowsize, the M-Table should be applied up to position k. After that position, we do not verify if the number of protected elements is greater or equal than anything, because there is no M-Table. In other words, the M-Table applies to the first k elements, the remaining elements are returned in the order in which they come, independently on whether they are protected or not.
This issue requires:
We should benchmark the performance of this plugin to understand the expected performance of the FAIR search algorithm.
For this we can use https://github.com/elastic/rally (recommended) or as a fallback the basic apache benchmark framework available at https://httpd.apache.org/docs/2.4/programs/ab.html
The first and straight forward way to integrate the fair search methodology in an elasticsearch plugin is to offer the algorithm as a [rescore] function. (https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-rescore.html).
Given a query like this:
GET studies/_search
{
"query": {
"match": {
"center": "Berlin"
}
},
"rescore": {
"window_size": 100,
"fair_rescorer" : {
"protected_key" : "gender",
"protected_value" : "Female"
}
}
}
The system will execute the query
"query": {
"match": {
"center": "Berlin"
}
and then apply the rescore function to raise the score of top fair results.
This functionality will allow an initial, clean and straightforward implementation implementation that will provide initial high value to users of this plugin.
Depends On: #1
Looking at the documentation, I noticed in http://fairsearch-elasticsearch.readthedocs.io/en/feature-docs/building_the_mtable.html that mtables need a name.
The name, if needed, should be automatically generated as "mtable(proportion, alpha, k)"
When k is too small (say, k<100) or p is too small (say, p<0.05), throw an exception.
@tsuehr must indicate which are the bounds by looking at the tables he generated.
We should create valid test cases based on the XING dataset (link)
This test should cover:
Elasticsearch plugins are usually distributed from a webserver, we should be able to provide a URL that people can use to download and access this plugin.
There is a hard validation in code (in FairRescoreBuilder.java) that checks that the index has only 1 shard (+1 replica).
This means that the plugin is unusable for bigger indexes, which for performance (and/or reliability) reasons need more than 1 shard (or 1 replica) when deployed (like in a production environment).
If this is an unavoidable limitation of the algorithm or the elasticsearch api used by it, please document it clearly in the readme / userguide (currently there is no mention about this), as for many potential users I think this is a very important consideration, which should be known upfront.
If on the other hand this is an avoidable limitation, and the code can be changed/improved so it works with multiple shards, then please consider improving it in this direction.
Hi,
currently the plugin just calls the mtable generator class, however with the inclusion of the adjust alpha process there is no connection between both.
@tsuehr can you make sure to include the logic necessary to generate the tables using the new alpha.
We should create valid test cases based on the German credit score dataset (link)
This test should cover:
M table calculations are expensive to compute, for this reason, they should be cached internally in elasticsearch.
We should be caching this as documents in an internal special index.
For this we should create an special REST endpoint, together with it's related transport actions. This endpoint will be responsible for:
POST _fairness_table/_create
{
"model" : {
"name": "foo",
"proportion": 3,
"alpha": 0.1,
"table": [[1,2,3],[4,5,6]]
}
}
the _table_ param is optional, if not set the table execution will calculated on demand, otherwise will be gotten from the parameter.
we use the same, but with PUT to trigger an update.
GET _fairness_table/foo
DELETE _fairness_table/foo
depends on: #1
Executing
POST /test/_search
{"query" : {
"match" : {
"body" : "hello"
}
},
"rescore" : {
"fair_rescorer" : {
"protected_key" : "gender",
"protected_value" : "f",
"significance_level" : 0.1,
"min_proportion_protected" : 0.8
}
}
}
without doing before
POST /_fs/_mtable/0.8/0.1/10
will lead to All shards failed error within elasticsearch.
Some error message like "mtable does not exist" would be better
As a user I would love to set up static configuration values as plugins settings, this options are values that will define how the fair query plugin will operate.
proportion_strategy = fixed / variable
Default fixed. The minimum proportion of protected attributes in every prefix.
significance_level = (float number)
Default 0.1. The significance level for the statistical test.
on_too_few_protected_elements= abort / proceed
Default proceed. What to do if there are less protected elements than the target.
min_proportion_protected = (float number)
Default 0.5. The minimum proportion of protected attributes at every prefix
lookup_for_measuring_proportion = (integer number)
Default 100. The number of top elements that are examined to determine the target proportion of protected elements.
As you can see in the elasticsearch documentation, there will be the option to change this settings in runtime.
Objective: To agree on this issue about our first set of configuration options for the plugin.
Proposals, enhancements and discussions are very welcome.
int max = Math.min(topDocs.scoreDocs.length, rescoreContext.getWindowSize());
at line 277 in FairRescoreBuilder.java will fail at the following example:
Docs in index =25
WindowSize = 10
k=10
Top 9 docs are nonprotected
We will only have one protected element there
for p= 0.8 and alpha = 0,1 we will need 6 protected elements
As a user of this plugin I am to directly execute the query as:
GET studies/_search
{
"query": {
"fair_query_string": {
"query": "Berlin",
"default_field" : "content",
"default_operator" : "OR",
"protected_key" : "gender",
"protected_value" : "Female"
}
}
}
this query will allow the users to:
Given a query string (similar functionality as in https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
Retrieve the fair list of search results based on the given parameters.
the proposed list of parameters are:
Once #3 is finished, this query will be the next logical step that provide direct user value.
Depends On: #3
We should write proper documentation of our plugin an publish it using the very popular https://readthedocs.org/ platform.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.