Agreement

Inter-rater agreement

Agreement library provides an implementation of popular metrics used to measure inter-rater agreement. Inter-rater agreement (know also as a inter-rater reliability) is used to describe the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges.

If you want to learn more about this topic, you can start by reading this Wikipedia page.

Implemented metrics

This library provides a pure numpy implementation of an extended formulas for following metrics:

Observed agreement
Bennett et al.'s S score
Cohen's kappa
Gwet's gamma
Krippendorff alpha
Scott's pi

And extended formulas can be used to measure agreement for;

multiple raters - support for two or more raters,
multiple categories - support for binary problems, as well as more categories,
missing ratings - not all raters provided answers for all the questions.
weighted agreement - used to model distance between categories (e.g. dist(5, 4) < dist(5, 1))

More information about implemented metrics can be found here: TODO

Implemented weights kernels

Agreement provides implementations for eight weight kernels:

identity kernel
linear kernel
quadratic kernel
ordinal_kernel
radical_kernel
radio_kernel
circular_kernel
bipolar_kernel

More information about implemented weights kernels can be found here: TODO

Installation

Agreement can be installed via pip from PyPI.

pip install agreement

Example usage

1. Prepare dataset

Let's assume you have a dataset in a format of a matrix with three columns: question id, rater id and answer.

import numpy  as np

dataset = np.array([
    [1, 1, 'a'],
    [1, 2, 'a'],
    [1, 3, 'c'],
    [2, 1, 'a'],
    [2, 2, 'b'],
    [2, 3, 'a'],
    [3, 1, 'c'],
    [3, 2, 'b'],
])

2. Transform dataset into matrices

In the next step we want to transform the dataset into matrices in a form accepted by the metrics functions.

Most of the matrices require a "questions answers" matrix, which contains a frequency of answers for each question. So more formally we could say M = I x A, where I is a list of all items and A is a list of all possible answers. Matrix element M_ij represents how many times answer j was chosen for the questions i.

The second matrix can be required (currently it is only required by the Cohen's kappa metrics) is "users answers" matrix, which contains a frequency of answers selected by each user. So more formally we could say M = U x A, where U is a list of all users and A is a list of all possible answers. Matrix element M_ij represents how many times answer j was chosen for the user i.

The library provides a helper functions that can be used to prepare that.

from agreement.utils.transform import pivot_table_frequency

questions_answers_table = pivot_table_frequency(dataset[:, 0], dataset[:, 2])
users_answers_table = pivot_table_frequency(dataset[:, 1], dataset[:, 2])

3. Select kernel

Weights are used to model situations, where categories are represented as (at least) ordinal data. Using this approach, the agreement between raters is not binary, but it differs depending on the weights between chosen categories.

There is no formal rule that can be used for deciding which set weights should be used in a particular study, so it all depends on your problem and the data your are working with.

In a default, metrics are using the identity_kernel, which do not provide any weighting between the answers. If you want to use an alternative kernel, you can import it from:

from agreement.utils.kernels import linear_kernel

4. Compute the metric

The last step is to chose the metric you want to compute and run following code:

from agreement.metrics import cohens_kappa, krippendorffs_alpha

kappa = cohens_kappa(questions_answers_table, users_answers_table)
weighted_kappa = cohens_kappa(questions_answers_table, users_answers_table, weights_kernel=linear_kernel)

alpha = krippendorffs_alpha(questions_answers_table)

For more detailed example see: TODO

Reference

All equations are based on the Handbook of Inter-Rater ReLiability, Kilem Li. Gwet, 2014. This book provides an extensive explanation to all topics related to inter-rater agreement. The book provides a detailed description of all metrics implemented in this library, as well as an example datasets that were used to this this implementation.

I also recommend taking a look at MatLab implementation of the same metrics mReliability, which provides a more detailed explanation of metrics' formulas then the one you will find here.

heolin / agreement Goto Github PK

agreement's Introduction

Agreement

Inter-rater agreement

Implemented metrics

Implemented weights kernels

Installation

Example usage

1. Prepare dataset

2. Transform dataset into matrices

3. Select kernel

4. Compute the metric

Reference

agreement's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent