Git Product home page Git Product logo

mateuszbuda / ml-stat-util Goto Github PK

View Code? Open in Web Editor NEW
48.0 6.0 15.0 29 KB

Statistical functions based on bootstrapping for computing confidence intervals and p-values comparing machine learning models and human readers

Home Page: https://mateuszbuda.github.io/2019/04/30/stat.html

License: MIT License

Dockerfile 0.63% Python 22.22% Jupyter Notebook 77.14%
statistics python machine-learning jupyter-notebook p-value confidence-intervals bootstrapping

ml-stat-util's Introduction

Machine Learning Statistical Utils

Docker setup for example jupyter notebook

docker build -t stat-util .
docker run --rm -p 8889:8889 -v `pwd`:/workspace stat-util

Use cases

Code for all use cases is provided in examples.ipynb notebook.

Evaluate a model with 95% confidence interval

from sklearn.metrics import roc_auc_score

import stat_util


score, ci_lower, ci_upper, scores = stat_util.score_ci(
    y_true, y_pred, score_fun=roc_auc_score
)

Compute p-value for comparison of two models

from sklearn.metrics import roc_auc_score

import stat_util


p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)

Compute mean performance with 95% confidence interval for a set of readers

import numpy as np
from sklearn.metrics import roc_auc_score

import stat_util


mean_score, ci_lower, ci_upper, scores = stat_util.score_stat_ci(
    y_true, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)

Compute p-value for comparison of one model and a set of readers

import numpy as np
from sklearn.metrics import roc_auc_score

import stat_util


p, z = stat_util.pvalue_stat(
    y_true, y_pred, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)

ml-stat-util's People

Contributors

joeranbosma avatar mateuszbuda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ml-stat-util's Issues

Is there a pip install method?

Dear author,

Thank you for sharing your code with us! May I ask is there a pip install method? I haven't used docker before.

Looking forward to your reply. Thx.

Two tailed p-values from pvalue are often larger than 1

Two tailed p-values from pvalue are often larger than 1. This is easily tested by using the function with y_pred1=y_pred2, or by setting the scale of y_pred1 to a larger value than that of y_pred2 in the example notebook.

The issue arises here:

    p = percentileofscore(z, 0.0, kind="weak") / 100.0
    if two_tailed:
        p *= 2.0

I think the fix is in the replacement:

    p = percentileofscore(z, 0.0, kind="weak") / 100.0
    if two_tailed:
        p = 2*min(p, 1-p)

This will provide also restore that p-values for two-tailed tests are the same regardless of the order of y_pred1 and y_pred2.
However, this will still give undesired p values of 0.0 if y_pred1=y_pred2, because the percentileofscore argument kind = "weak". To remedy this, one could do

    p = percentileofscore(z, 0.0, kind="mean") / 100.0
    if two_tailed:
        p = 2*min(p, 1-p)

Which will result in p values of 1.0 again, but slightly changes the p values compared to the old calculation, even when p<1-p.

P-value larger than 1

Thank you so much for the nice written code package.
I am using stat_util.pvalue to compare the AUC of two models. However, I obtain a p-value output greater than 1. Is this possible? How should I interpret this? I knowpred1 has AUC 0.71 and pred2 has 0.58.

image

method for calculating p-value

Thank you so much for your ml-stat-util.
I want to know the method you compute the p-value. Is it the same as the DeLong test?

Comparing two ML models using p value

Thanks alot @mateuszbuda for the tool. I am trying to compare the two models (for example: random forest, xgboost) using p value.
I have two predicted results for both trained models on test set:
y_pred1 = model1.predict(x_test)
y_pred2 = model2.predict(x_test)
y_true = y_test

from sklearn.metrics import r2_score
import stat_util
y_true = y_test
p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=r2_score)

output: 0.0
I am always getting p = 0.0. Would you please correct me if I have not done it correctly. Thank you.

Possibly Incorrect Calculation of p-Value

Hi Mateusz. Thanks for the super helpful repo! But if I'm not mistaken, I think there's a theoretical error in the way you're calculating your p-value.

Right now, you're computing the p-value for a difference in the performance of two models, measured with AUROC, using the following code:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)
bins = plt.hist(z)
plt.plot([0, 0], [0, np.max(bins[0])], color="black")
print('p =', p)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

image_01

Here, your null hypothesis is that the two models perform equally well, i.e. the difference between their AUROC is equal to 0. And your alternative hypothesis is that model 1 performs better than model 2 (based on the internal code for the function). z contains a list of differences in AUROC for randomly sampled subsets of predictions from both models. p is the probability of the difference in AUROC being 0. In other words, your p-value is the probability of the null hypothesis (difference in AUROC = 0) being true. However, theoretically, the p-value is actually the probability of an observed sample statistic, given that the null hypothesis is true. These two statements aren't the same.

Hence, the revision I propose using your same function, is this:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

_, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)

# Observed Sample Statistic (Difference in AUROC of Model 1 and 2)
sample_diff = roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)

# Simulate Distribution of Null Hypothesis using Statistics from Bootstrapping
null_vals = np.random.normal(loc = 0.00, scale = np.std(np.abs(z)), size=2000)
null_dist = plt.hist(null_vals, color='red', alpha=1.0)

# Display Observed Sample Statistic in Same Plot 
plt.axvline(sample_diff, color="black")

# Compute p-Value
print('p =',1-percentileofscore(null_vals, sample_diff, kind="weak") / 100.0)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

image_02

Here, the p-value is the probability of observing a difference in AUROC of the two models equal to the observed sample statistic [roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)] or greater, given that the null hypothesis is true.

If we only compare whether the p-value is below a certain significance level (eg. 0.05) or not, then your method and mine agree for all 10 cases of different model predictions, that I've tried them on. However, the actual p-value itself is always different.

What do you think?

Reference:
[1] https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests
[2] https://towardsdatascience.com/bootstrapping-for-inferential-statistics-9b613a7653b2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.