mateuszbuda / ml-stat-util Goto Github PK

Statistical functions based on bootstrapping for computing confidence intervals and p-values comparing machine learning models and human readers

Home Page: https://mateuszbuda.github.io/2019/04/30/stat.html

License: MIT License

Dockerfile 0.63% Python 22.22% Jupyter Notebook 77.14%

statistics python machine-learning jupyter-notebook p-value confidence-intervals bootstrapping

ml-stat-util's Introduction

Machine Learning Statistical Utils

Docker setup for example jupyter notebook

docker build -t stat-util .

docker run --rm -p 8889:8889 -v `pwd`:/workspace stat-util

Use cases

Code for all use cases is provided in examples.ipynb notebook.

Evaluate a model with 95% confidence interval

from sklearn.metrics import roc_auc_score

import stat_util


score, ci_lower, ci_upper, scores = stat_util.score_ci(
    y_true, y_pred, score_fun=roc_auc_score
)

Compute p-value for comparison of two models

from sklearn.metrics import roc_auc_score

import stat_util


p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)

Compute mean performance with 95% confidence interval for a set of readers

import numpy as np
from sklearn.metrics import roc_auc_score

import stat_util


mean_score, ci_lower, ci_upper, scores = stat_util.score_stat_ci(
    y_true, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)

Compute p-value for comparison of one model and a set of readers

import numpy as np
from sklearn.metrics import roc_auc_score

import stat_util


p, z = stat_util.pvalue_stat(
    y_true, y_pred, y_pred_readers, score_fun=roc_auc_score, stat_fun=np.mean
)

ml-stat-util's People

Contributors

Stargazers

Watchers

Forkers

amamra-abdenour xuzhang5788 jiayangvivid fitushar gongshichina chengjianhong smith-colton-principal hodeig-ai silhouetteq trtnk wisamreid kim-byung-woo rocke2020 dongchengdonghangzhou johnnydfci

ml-stat-util's Issues

Is there a pip install method?

Dear author,

Thank you for sharing your code with us! May I ask is there a pip install method? I haven't used docker before.

Looking forward to your reply. Thx.

Can it only analysis roc_auc_score?

Thank you so much for your util.
If I want to analysis precision-recall or f1, how can I use your util?

Two tailed p-values from pvalue are often larger than 1

Two tailed p-values from pvalue are often larger than 1. This is easily tested by using the function with y_pred1=y_pred2, or by setting the scale of y_pred1 to a larger value than that of y_pred2 in the example notebook.

The issue arises here:

    p = percentileofscore(z, 0.0, kind="weak") / 100.0
    if two_tailed:
        p *= 2.0

I think the fix is in the replacement:

    p = percentileofscore(z, 0.0, kind="weak") / 100.0
    if two_tailed:
        p = 2*min(p, 1-p)

This will provide also restore that p-values for two-tailed tests are the same regardless of the order of y_pred1 and y_pred2.
However, this will still give undesired p values of 0.0 if y_pred1=y_pred2, because the percentileofscore argument kind = "weak". To remedy this, one could do

    p = percentileofscore(z, 0.0, kind="mean") / 100.0
    if two_tailed:
        p = 2*min(p, 1-p)

Which will result in p values of 1.0 again, but slightly changes the p values compared to the old calculation, even when p<1-p.

P-value larger than 1

Thank you so much for the nice written code package.
I am using stat_util.pvalue to compare the AUC of two models. However, I obtain a p-value output greater than 1. Is this possible? How should I interpret this? I knowpred1 has AUC 0.71 and pred2 has 0.58.

method for calculating p-value

Thank you so much for your ml-stat-util.
I want to know the method you compute the p-value. Is it the same as the DeLong test?

Comparing two ML models using p value

Thanks alot @mateuszbuda for the tool. I am trying to compare the two models (for example: random forest, xgboost) using p value.
I have two predicted results for both trained models on test set:
y_pred1 = model1.predict(x_test)
y_pred2 = model2.predict(x_test)
y_true = y_test

from sklearn.metrics import r2_score
import stat_util
y_true = y_test
p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=r2_score)

output: 0.0
I am always getting p = 0.0. Would you please correct me if I have not done it correctly. Thank you.

Possibly Incorrect Calculation of p-Value

Hi Mateusz. Thanks for the super helpful repo! But if I'm not mistaken, I think there's a theoretical error in the way you're calculating your p-value.

Right now, you're computing the p-value for a difference in the performance of two models, measured with AUROC, using the following code:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)
bins = plt.hist(z)
plt.plot([0, 0], [0, np.max(bins[0])], color="black")
print('p =', p)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

Here, your null hypothesis is that the two models perform equally well, i.e. the difference between their AUROC is equal to 0. And your alternative hypothesis is that model 1 performs better than model 2 (based on the internal code for the function). z contains a list of differences in AUROC for randomly sampled subsets of predictions from both models. p is the probability of the difference in AUROC being 0. In other words, your p-value is the probability of the null hypothesis (difference in AUROC = 0) being true. However, theoretically, the p-value is actually the probability of an observed sample statistic, given that the null hypothesis is true. These two statements aren't the same.

Hence, the revision I propose using your same function, is this:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

_, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)

# Observed Sample Statistic (Difference in AUROC of Model 1 and 2)
sample_diff = roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)

# Simulate Distribution of Null Hypothesis using Statistics from Bootstrapping
null_vals = np.random.normal(loc = 0.00, scale = np.std(np.abs(z)), size=2000)
null_dist = plt.hist(null_vals, color='red', alpha=1.0)

# Display Observed Sample Statistic in Same Plot 
plt.axvline(sample_diff, color="black")

# Compute p-Value
print('p =',1-percentileofscore(null_vals, sample_diff, kind="weak") / 100.0)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

Here, the p-value is the probability of observing a difference in AUROC of the two models equal to the observed sample statistic [roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)] or greater, given that the null hypothesis is true.

If we only compare whether the p-value is below a certain significance level (eg. 0.05) or not, then your method and mine agree for all 10 cases of different model predictions, that I've tried them on. However, the actual p-value itself is always different.

What do you think?

Reference:
[1] https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests
[2] https://towardsdatascience.com/bootstrapping-for-inferential-statistics-9b613a7653b2