Git Product home page Git Product logo

bcg-x-official / facet Goto Github PK

View Code? Open in Web Editor NEW
495.0 12.0 46.0 51.7 MB

Human-explainable AI.

Home Page: https://bcg-x-official.github.io/facet

License: Apache License 2.0

Jupyter Notebook 87.88% Python 12.11% Shell 0.01% CSS 0.01%
data-science machine-learning python data-analytics model-selection hyperparameter-tuning interpretability explainable-ai shap-vector-decomposition statistics

facet's Introduction

image

FACET is an open source library for human-explainable AI. It combines sophisticated model inspection and model-based simulation to enable better explanations of your supervised machine learning models.

FACET is composed of the following key components:

           

inspect

Model Inspection

FACET introduces a new algorithm to quantify dependencies and interactions between features in ML models. This new tool for human-explainable AI adds a new, global perspective to the observation-level explanations provided by the popular SHAP approach. To learn more about FACET’s model inspection capabilities, see the getting started example below.

           

sim

Model Simulation

FACET’s model simulation algorithms use ML models for virtual experiments to help identify scenarios that optimise predicted outcomes. To quantify the uncertainty in simulations, FACET utilises a range of bootstrapping algorithms including stationary and stratified bootstraps. For an example of FACET’s bootstrap simulations, see the quickstart example below.

           

pipe

Enhanced Machine Learning Workflow

FACET offers an efficient and transparent machine learning workflow, enhancing scikit-learn's tried and tested pipelining paradigm with new capabilities for model selection, inspection, and simulation. FACET also introduces sklearndf [documentation] an augmented version of scikit-learn with enhanced support for pandas data frames that ensures end-to-end traceability of features.

pypi conda python_versions code_style made_with_sphinx_doc license_badge

Installation

FACET supports both PyPI and Anaconda. We recommend to install FACET into a dedicated environment.

Anaconda

conda create -n facet
conda activate facet
conda install -c bcg_gamma -c conda-forge gamma-facet

Pip

macOS and Linux:

python -m venv facet
source facet/bin/activate
pip install gamma-facet

Windows:

python -m venv facet
facet\Scripts\activate.bat
pip install gamma-facet

Quickstart

The following quickstart guide provides a minimal example workflow to get you up and running with FACET. For additional tutorials and the API reference, see the FACET documentation.

Changes and additions to new versions are summarized in the release notes.

Enhanced Machine Learning Workflow

To demonstrate the model inspection capability of FACET, we first create a pipeline to fit a learner. In this simple example we will use the diabetes dataset which contains age, sex, BMI and blood pressure along with 6 blood serum measurements as features. This dataset was used in this publication. A transformed version of this dataset is also available on scikit-learn here.

In this quickstart we will train a Random Forest regressor using 10 repeated 5-fold CV to predict disease progression after one year. With the use of sklearndf we can create a pandas DataFrame compatible workflow. However, FACET provides additional enhancements to keep track of our feature matrix and target vector using a sample object (Sample) and easily compare hyperparameter configurations and even multiple learners with the LearnerSelector.

# standard imports
import pandas as pd
from sklearn.model_selection import RepeatedKFold, GridSearchCV

# some helpful imports from sklearndf
from sklearndf.pipeline import RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF

# relevant FACET imports
from facet.data import Sample
from facet.selection import LearnerSelector, ParameterSpace

# declaring url with data
data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'

#importing data from url
diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
    # renaming columns for better readability
    columns={
        'S1': 'TC', # total serum cholesterol
        'S2': 'LDL', # low-density lipoproteins
        'S3': 'HDL', # high-density lipoproteins
        'S4': 'TCH', # total cholesterol/ HDL
        'S5': 'LTG', # lamotrigine level
        'S6': 'GLU', # blood sugar level
        'Y': 'Disease_progression' # measure of progress since 1yr of baseline
    }
)

# create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")

# create a (trivial) pipeline for a random forest regressor
rnd_forest_reg = RegressorPipelineDF(
    regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)
)

# define parameter space for models which are "competing" against each other
rnd_forest_ps = ParameterSpace(rnd_forest_reg)
rnd_forest_ps.regressor.min_samples_leaf = [8, 11, 15]
rnd_forest_ps.regressor.max_depth = [4, 5, 6]

# create repeated k-fold CV iterator
rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

# rank your candidate models by performance
selector = LearnerSelector(
    searcher_type=GridSearchCV,
    parameter_space=rnd_forest_ps,
    cv=rkf_cv,
    n_jobs=-3,
    scoring="r2"
).fit(diabetes_sample)

# get summary report
selector.summary_report()

image

We can see based on this minimal workflow that a value of 11 for minimum samples in the leaf and 5 for maximum tree depth was the best performing of the three considered values. This approach easily extends to additional hyperparameters for the learner, and for multiple learners.

Model Inspection

FACET implements several model inspection methods for scikit-learn estimators. FACET enhances model inspection by providing global metrics that complement the local perspective of SHAP (see [arXiv:2107.12436] for a formal description).

The key global metrics for each pair of features in a model are:

  • Synergy

    The degree to which the model combines information from one feature with another to predict the target. For example, let's assume we are predicting cardiovascular health using age and gender and the fitted model includes a complex interaction between them. This means these two features are synergistic for predicting cardiovascular health. Further, both features are important to the model and removing either one would significantly impact performance. Let's assume age brings more information to the joint contribution than gender. This asymmetric contribution means the synergy for (age, gender) is less than the synergy for (gender, age). To think about it another way, imagine the prediction is a coordinate you are trying to reach. From your starting point, age gets you much closer to this point than gender, however, you need both to get there. Synergy reflects the fact that gender gets more help from age (higher synergy from the perspective of gender) than age does from gender (lower synergy from the perspective of age) to reach the prediction. This leads to an important point: synergy is a naturally asymmetric property of the global information two interacting features contribute to the model predictions. Synergy is expressed as a percentage ranging from 0% (full autonomy) to 100% (full synergy).

  • Redundancy

    The degree to which a feature in a model duplicates the information of a second feature to predict the target. For example, let's assume we had house size and number of bedrooms for predicting house price. These features capture similar information as the more bedrooms the larger the house and likely a higher price on average. The redundancy for (number of bedrooms, house size) will be greater than the redundancy for (house size, number of bedrooms). This is because house size "knows" more of what number of bedrooms does for predicting house price than vice-versa. Hence, there is greater redundancy from the perspective of number of bedrooms. Another way to think about it is removing house size will be more detrimental to model performance than removing number of bedrooms, as house size can better compensate for the absence of number of bedrooms. This also implies that house size would be a more important feature than number of bedrooms in the model. The important point here is that like synergy, redundancy is a naturally asymmetric property of the global information feature pairs have for predicting an outcome. Redundancy is expressed as a percentage ranging from 0% (full uniqueness) to 100% (full redundancy).

# fit the model inspector
from facet.inspection import LearnerInspector
inspector = LearnerInspector(
    pipeline=selector.best_estimator_,
    n_jobs=-3
).fit(diabetes_sample)

Synergy

# visualise synergy as a matrix
from pytools.viz.matrix import MatrixDrawer
synergy_matrix = inspector.feature_synergy_matrix()
MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Synergy Matrix")

image

For any feature pair (A, B), the first feature (A) is the row, and the second feature (B) the column. For example, looking across the row for LTG (Lamotrigine) there is hardly any synergy with other features in the model (≤ 1%). However, looking down the column for LTG (i.e., from the perspective of other features relative with LTG) we find that many features (the rows) are aided by synergy with with LTG (up to 27% in the case of LDL). We conclude that:

  • LTG is a strongly autonomous feature, displaying minimal synergy with other features for predicting disease progression after one year.
  • The contribution of other features to predicting disease progression after one year is partly enabled by the presence of LTG.

High synergy between pairs of features must be considered carefully when investigating impact, as the values of both features jointly determine the outcome. It would not make much sense to consider LDL without the context provided by LTG given close to 27% synergy of LDL with LTG for predicting progression after one year.

Redundancy

# visualise redundancy as a matrix
redundancy_matrix = inspector.feature_redundancy_matrix()
MatrixDrawer(style="matplot%").draw(redundancy_matrix, title="Redundancy Matrix")

image

For any feature pair (A, B), the first feature (A) is the row, and the second feature (B) the column. For example, if we look at the feature pair (LDL, TC) from the perspective of LDL (Low-Density Lipoproteins), then we look-up the row for LDL and the column for TC and find 38% redundancy. This means that 38% of the information in LDL to predict disease progression is duplicated in TC. This redundancy is the same when looking "from the perspective" of TC for (TC, LDL), but need not be symmetrical in all cases (see LTG vs. TCH).

If we look at TCH, it has between 22–32% redundancy each with LTG and HDL, but the same does not hold between LTG and HDL – meaning TCH shares different information with each of the two features.

Clustering redundancy

As detailed above redundancy and synergy for a feature pair is from the "perspective" of one of the features in the pair, and so yields two distinct values. However, a symmetric version can also be computed that provides not only a simplified perspective but allows the use of (1 - metric) as a feature distance. With this distance hierarchical, single linkage clustering is applied to create a dendrogram visualization. This helps to identify groups of low distance, features which activate "in tandem" to predict the outcome. Such information can then be used to either reduce clusters of highly redundant features to a subset or highlight clusters of highly synergistic features that should always be considered together.

Let's look at the example for redundancy.

# visualise redundancy using a dendrogram
from pytools.viz.dendrogram import DendrogramDrawer
redundancy = inspector.feature_redundancy_linkage()
DendrogramDrawer().draw(data=redundancy, title="Redundancy Dendrogram")

image

Based on the dendrogram we can see that the feature pairs (LDL, TC) and (HDL, TCH) each represent a cluster in the dendrogram and that LTG and BMI have the highest importance. As potential next actions we could explore the impact of removing TCH, and one of TC or LDL to further simplify the model and obtain a reduced set of independent features.

Please see the API reference for more detail.

Model Simulation

Taking the BMI feature as an example of an important and highly independent feature, we do the following for the simulation:

  • We use FACET's ContinuousRangePartitioner to split the range of observed values of BMI into intervals of equal size. Each partition is represented by the central value of that partition.
  • For each partition, the simulator creates an artificial copy of the original sample assuming the variable to be simulated has the same value across all observations – which is the value representing the partition. Using the best estimator acquired from the selector, the simulator now re-predicts all targets using the models trained for full sample and determines the uplift of the target variable resulting from this.
  • The FACET SimulationDrawer allows us to visualise the result; both in a matplotlib and a plain-text style.
# FACET imports
from facet.validation import BootstrapCV
from facet.simulation import UnivariateUpliftSimulator
from facet.data.partition import ContinuousRangePartitioner
from facet.simulation.viz import SimulationDrawer

# create bootstrap CV iterator
bscv = BootstrapCV(n_splits=1000, random_state=42)

SIM_FEAT = "BMI"
simulator = UnivariateUpliftSimulator(
    model=selector.best_estimator_,
    sample=diabetes_sample,
    n_jobs=-3
)

# split the simulation range into equal sized partitions
partitioner = ContinuousRangePartitioner()

# run the simulation
simulation = simulator.simulate_feature(feature_name=SIM_FEAT, partitioner=partitioner)

# visualise results
SimulationDrawer().draw(data=simulation, title=SIM_FEAT)

image

We would conclude from the figure that higher values of BMI are associated with an increase in disease progression after one year, and that for a BMI of 28 and above, there is a significant increase in disease progression after one year of at least 26 points.

Contributing

FACET is stable and is being supported long-term.

Contributions to FACET are welcome and appreciated. For any bug reports or feature requests/enhancements please use the appropriate GitHub form, and if you wish to do so, please open a PR addressing the issue.

We do ask that for any major changes please discuss these with us first via an issue or using our team email: [email protected].

For further information on contributing please see our contribution guide.

License

FACET is licensed under Apache 2.0 as described in the LICENSE file.

Acknowledgements

FACET is built on top of two popular packages for Machine Learning:

  • The scikit-learn learners and pipelining make up implementation of the underlying algorithms. Moreover, we tried to design the FACET API to align with the scikit-learn API.
  • The SHAP implementation is used to estimate the shapley vectors which FACET then decomposes into synergy, redundancy, and independence vectors.

BCG GAMMA

If you would like to know more about the team behind FACET please see the about us page.

We are always on the lookout for passionate and talented data scientists to join the BCG GAMMA team. If you would like to know more you can find out about BCG GAMMA, or have a look at career opportunities.

facet's People

Contributors

bhadurishirsesh avatar j-ittner avatar jason-bentley avatar joerg-schneider avatar konst-int-i avatar mgelsm avatar mtsokol avatar sithankanna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

facet's Issues

warnings about probabilities from weighted classifiers and calibration for simulation

Is your feature request related to a problem? Please describe.
When sample weights are applied to a learner classifier the up weighting on one class will be reflected in a higher predicted probability than what was observed in the unweighted data.

Describe the solution you'd like
There are two aspects to an ideal solution:

  1. when ever using weights with a classifier and performing simulation a warning should be displayed noting as such and that uncalibrated probabilities may not align with observed rates for the positive class.
  2. post training calibration could be added as a toggled option for the ClassifierPipelineDF class, where the default is set to true but can be turned off if desired. This could help ensure that naively even with weights applied to learning the probabilities shown in the simulation outputs align reasonably well with those observed in the data.

Describe alternatives you've considered
None.

Additional context
None.

Versioning & Compatibility XGBoost

Hi,
Thanks for making this tool openly available. Very cool.

While I could get it to run, I do face some issues due to versioning and compatibility that I wanted to report:
(1) When following the conda Installation instructions, one will install 2.0rc2. This breaks the Quickstart Tutorial:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 11
      9 # relevant FACET imports
     10 from facet.data import Sample
---> 11 from facet.selection import LearnerRanker, LearnerGrid
     13 # declaring url with data
     14 data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'

ImportError: cannot import name 'LearnerRanker' from 'facet.selection' (C:\Users\admin\.conda\envs\jakob-facet\lib\site-packages\facet\selection\__init__.py)

(2) Ideally, I want to run facet with XGBoost. While the core functionality is in sklearndf 2.1.0, facet pins this to version ~= 1.2.
Unfortunately, the newer sklearndf is not compatible, see the following error running 2.1.0 in facet 1.2.3

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 54
     51 rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
     53 # rank your candidate models by performance (default is mean CV score - 2*SD)
---> 54 ranker = LearnerRanker(
     55     grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
     56 ).fit(sample=diabetes_sample)
     58 # get summary report
     59 ranker.summary_report()

File ~/miniconda3/envs/facet/lib/python3.8/site-packages/facet/selection/_selection.py:400, in LearnerRanker.fit(self, sample, **fit_params)
    387 """
    388 Rank the candidate learners and their hyper-parameter combinations using
    389 crossfits from the given sample.
   (...)
    396 :return: ``self``
    397 """
    398 self: LearnerRanker[T_LearnerPipelineDF]  # support type hinting in PyCharm
--> 400 ranking: List[LearnerEvaluation[T_LearnerPipelineDF]] = self._rank_learners(
    401     sample=sample, **fit_params
    402 )
    403 ranking.sort(key=lambda le: le.ranking_score, reverse=True)
    405 self._ranking = ranking
...
--> 873     obj = super().__new__(cls)
    874 else:
    875     obj = super().__new__(cls, *args, **kwds)

TypeError: Can't instantiate abstract class _FitScoreQueue with abstract methods aggregate

Is there a workaround to get it to run with XGBoost?

(3) On another note: I did experience a quadratic increase in memory consumption with the number of features. For a workstation with 128 GB RAM this effectively limits me to ~100 features (@ ~ 5000 rows). Is this expected?

SHAP Feature Values Inverted

It appears the SHAP summary plot has the 'Feature Value' Inverted. Your classification example has 'Waist_to_Height' to be positively correlated with 'Prediabetes'.

image

From the SHAP output, you can see a higher values of 'Waist_to_Height' (values in red) have negative impact on model, which is the opposite.

image

I've also tested this using the FACET package and then just running a separate RF model to get the SHAP output outside of FACET and the non-FACET SHAP outputs are as expected and not inverted.

User warnings and aligned data with SHAP output for low numbers of splits

Is your feature request related to a problem? Please describe.
The number of n_splits used in the crossfit impacts the coverage of observations inspected for calculating SHAP values. With low coverage the number of rows in the consolidated SHAP matrix is less than the number of observations.

Describe the solution you'd like
The ideal solution has a few elements:

  • A warning should appear for a low number of splits along with a message indicating the coverage of observations for SHAP value calculation.
  • The inspector should produce all the inputs required for utilising existing shap plotting functions. The inspector should automatically create a sample that contains only the observations that have been explained, so it is aligned with the SHAP outputs.

Describe alternatives you've considered
None - the above solution is the minimum requirement.

Additional context
As an example using 500 simulated data points we can see that in the extreme case of using n_splits = 1, we find the SHAP analysis covers 40% of observations:
image

understanding synergy asymmetry

Hi,

Thanks for sharing the package, it is very interesting. I'm trying to understand how synergy works more in detail, but I haven't find a satisfactory answer so far.

1. If synergy (as well as redundancy) relies on the SHAP interaction values, which are symmetric, how do you make it asymmetric ? How would you describe the big steps to compute this metric ?

Intuitively, I better understand what a symmetric interaction is - in a certain way, it quantifies what are the additional contributions on the output when the features are together - but giving an asymmetric definition is harder. In the examples, you mention terms such as "the feature is autonomous" or "a feature gets you much closer to a prediction than another" but it is a bit vague and/or I don't see how this is related to feature interaction, but rather to feature importance / correlation.

2. From a practical standpoint, redundancy is a bit easier to understand and can lead to some feature selection (basically if features share the same info, you could think of removing one), but what are the implications of synergy ?

If for example a feature pair has high synergy in one direction and low in the other, what should I conclude ? Should I do some feature selection with it ?

3. Also, if you have some details about the math behind it (a paper or a description), it would be great ! I looked into the code but it is a bit hard to understand it from there

Thanks a lot !

gamma-facet==1.0.1 not compatible with latest shap==0.38.1

Hello,
I installed gamma-facet==1.0.1, together with shap==0.38.1 (both latest versions).

When calling

from facet.inspection import LearnerInspector

I get the error
ModuleNotFoundError: No module named 'shap.explainers.explainer'
ar line 13 of python3.8/site-packages/facet/inspection/_explainer.py .

Are you able to provide a fix for this?

Thanks!

ModuleNotFoundError: No module named 'facet.data'; 'facet' is not a package

Hey!

First of all, this seems like a fantastic library! The visualizations are not notch. I am trying to replicate the example for my YouTube channel (I hope that is allowed), but I have issues importing the libraries. You can find the colab link below:
https://colab.research.google.com/drive/1mYqg8b4VhKvw5d_ExFqE3nge_yr5G_ql?usp=sharing

Here is the code error:
ModuleNotFoundError Traceback (most recent call last)
in ()
4 from sklearndf.pipeline import RegressorPipelineDF
5 from sklearndf.regression import RandomForestRegressorDF
----> 6 from facet.data import Sample
7 from facet.selection import LearnerRanker, LearnerGrid

ModuleNotFoundError: No module named 'facet.data'; 'facet' is not a package


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Expose full distribution of outputs on simulation results

Is your feature request related to a problem? Please describe.
In order to facilitate simulation instigation, it would be great to have the possibility to obtain the outcome distribution for each partition value

Describe the solution you'd like
The best way would be to either obtain a list of percentiles of the distribution or the full distribution itself

LeanerRanker summary output as Pandas DataFrame

Is your feature request related to a problem? Please describe.
When looking at the summary output from the LearnerRanker() it would be great to have an alternative to the print(ranker.summary_report(5)) which prints the top 5 models for example.

An option to allow further summaries generated by the user could be to output a Pandas DataFrame. This would allow flexibility for subsequent uses, for example, outputting to csv's for reports or creating summary figures of performance. The ability to store a DF would also allow users to combine with similar DFs from future runs if updating models to see the changes, etc.

Describe the solution you'd like
One option could be to add an option to export the ranker.summary_report(5) to a Pandas DataFrame with something like rank_summary = ranker.summary_report(as_dataframe=True). Which I would then expect to produce something along the lines of the following as a Pandas DataFrame.

Rank Learner Ranking_score Mean_score SD_score Tuned_parameters N_folds Socring_metric
1 LGBMClassifierDF 0.656 0.680 0.0122 classifier__n_estimators=400 10 roc_auc
2 LGBMClassifierDF 0.655 0.677 0.0111 classifier__n_estimators=500 10 roc_auc
3 RandomForestClassifierDF 0.650 0.695 0.0224 classifier__n_estimators=200 10 roc_auc
4 RandomForestClassifierDF 0.647 0.696 0.0244 classifier__n_estimators=300 10 roc_auc
5 RandomForestClassifierDF 0.646 0.697 0.0255 classifier__n_estimators=400 10 roc_auc

Describe alternatives you've considered
Have not considered alternatives.

Future Implementation for Tensorflow and Pytorch

Is your feature request related to a problem? Please describe.

Hello,
First, congratulations on the library, this is useful concepts.
I would like to know if this library has plans to be implemented on other ML libraries like Tensorflow / Pytorch? I don't see any limitations.

Racism and the "load_boston" dataset

I just read the announcement of this project here. This quote can be found;

BCG GAMMA FACET Helps Human Operators Understand Advanced Machine Learning Models So They Can Make Better and More Ethical Decisions

When I go to this repo (which isn't linked in the blog post by the way) I find a demo that is using the load_boston dataset to explain how to use the tool. It seems to focus on the LSTAT attribute of this dataset while it fails to acknowledge a bigger issue related to ethics, namely the B attribute. According to the docs it refers to:

B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

Given that this tool is marketed with themes of ethics and understanding, and that it is backed by an international consultancy company, this is really not cool. This tool is marketed to help people make more ethical decisions. So why does the guide present a model that uses skin color to determine house prices without mentioning it? For a guide that could be seen as an authoritative source on how to handle "ethical decisions" this is really dubious.

Note that this dataset is up for removal from scikit-learn because of this controversy and it's also something that's been pointed out at many conferences. Here is one talk from me if you're interested.

Please replace the guide with an example that is more fitting or at the very least acknowledges the issues with the variable.

How to calculate SRI for nonlinear models?

@mtsokol

#374 related.

Thank you. I have modified your code and considered non-linear models such as KernelRidge.

However, KernelRidge is naturally not compatible with TreeExplainerFactory, so I considered using KernelExplainerFactory or ExactExplainerFactory. However, since ExactExplainerFactory is not usable depending on the size of the dataset, I adopted KernelExplainerFactory(shap_interaction=True).

In this case, a RuntimeError occurs.
RuntimeError: SHAP interaction values have not been calculated. Create an inspector with parameter 'shap_interaction=True' to enable calculations involving SHAP interaction values.

Checking your implementation, it seems that KernelExplainerFactory does not compute shap_interaction.

def supports_shap_interaction_values(self) -> bool:

I have two questions.
1.
For non-linear models, is it necessary to use ExactExplainerFactory and perform inspector.fit()? What should I do if the data size is large?
2.
The specification that KernelExplainerFactory internally converts shap_interaction=True to False is confusing. Would it be better to throw an error if shap_interaction=True is specified, or change it so that the shap_interaction argument cannot be specified at all?

import pandas as pd
from sklearn.model_selection import RepeatedKFold, GridSearchCV

# some helpful imports from sklearndf
from sklearndf.pipeline import RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF

# relevant FACET imports
from facet.data import Sample
from facet.selection import LearnerSelector, ParameterSpace

from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)
data = load_diabetes()

X = pd.DataFrame(X)
X.columns  = data["feature_names"]
y = pd.DataFrame(y)
y.columns = ["target"]
diabetes_df = pd.concat([X,y], axis=1)

# create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="target")

# create a (trivial) pipeline for a random forest regressor

from sklearn.kernel_ridge import KernelRidge
model = KernelRidge()
model.fit(X,y)

# fit the model inspector
from facet.inspection import NativeLearnerInspector
inspector = NativeLearnerInspector(
    model=model,
    explainer_factory=KernelExplainerFactory(),
    n_jobs=-3,
    shap_interaction=True
    
)
inspector.fit(diabetes_sample)

# visualise synergy as a matrix
from pytools.viz.matrix import MatrixDrawer
synergy_matrix = inspector.feature_synergy_matrix()

# visualise redundancy as a matrix
redundancy_matrix = inspector.feature_redundancy_matrix()
# visualise redundancy using a dendrogram
import matplotlib
from pytools.viz.dendrogram import DendrogramDrawer
redundancy = inspector.feature_redundancy_linkage()

Support SAGE values similar to SHAP

I am also interested in the SAGE value, which measures the contribution to global prediction accuracy, compared to SHAP, which measures the importance of local predictions.
We believe that if SRI decomposition can contribute to prediction accuracy, it will be an important advance in terms of model interpretability and feature selection.

On the other hand, I understand that it is not easy to support SAGE, which is a different concept, but is it likely to be realized?

https://pypi.org/project/sage-importance/
https://iancovert.com/blog/understanding-shap-sage/

README.rs dataset load can be automated for users

Describe the bug
README.rs mentions details of the dataset being used, but then imports a CSV that users might not have. Given the dataset is from sklearn, we can improve the code-block for more usability

To Reproduce
Steps to reproduce the behavior:

  1. Go to README.rs
  2. Follow steps till code block (line 110) and run the steps
  3. Read CSV step (line 123) will raise a 'Filenot not found'

Expected behavior
Automated dataload directly from sklearn, with no read_csv step

Screenshots
(https://user-images.githubusercontent.com/91214098/134666712-7c22ecca-a556-42d1-9270-10b5e54c5eb5.png)

Run times are huge

Describe the bug
I am using a dataset with 130 columns and the 1000 rows. The below steps keeps on running for more than an hour with no results produced

I also tried with just 20 columns and 500 rows. the behavior is the same.

step 1 :

from facet.inspection import LearnerInspector
inspector = LearnerInspector()
inspector.fit(crossfit=ranker.best_model_crossfit_)

step2:

boot_crossfit = LearnerCrossfit(
    pipeline=ranker.best_model_,
    cv=bscv,
    n_jobs=-3,
    verbose=True,
).fit(sample=df_sample)

Desktop (please complete the following information):

  • OS: [MacOS]

Isolated Sphinx doc does not build due to missing pytools script

Describe the bug
Sphinx doc does not build due to missing pytools script make_base if pytools is not in same parent directory

To Reproduce
Steps to reproduce the behavior:

  1. Clone the repo into a separate directory from any pytools clones
  2. Build and activate the conda environment in environment.yml
  3. From the sphinx directory, run python make.py html
  4. See error
Traceback (most recent call last):
  File "make.py", line 30, in <module>
    make()
  File "make.py", line 24, in make
    from make_base import make
ModuleNotFoundError: No module named 'make_base'

Expected behavior
Docs should build.

Desktop (please complete the following information):

  • OS: MacOS Big Sur
  • Version 1.2.x and 2.0.x

Additional context
Likely this is just a note in contributing.md to make sure that the remotes are all present before trying to build the docs.

'LearnerRanker' object has no attribute '_ensure_fitted'

Describe the bug

I have encountered the issue that the attribute _ensure_fitted the LeanerRanker is calling is missing.
This also happens when I run the example code:


standard imports
import pandas as pd
from sklearn.model_selection import RepeatedKFold

some helpful imports from sklearndf
from sklearndf.pipeline import RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF_

relevant FACET imports
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid

declaring url with data
data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'

importing data from url
diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
renaming columns for better readability
columns={
'S1': 'TC', # total serum cholesterol
'S2': 'LDL', # low-density lipoproteins
'S3': 'HDL', # high-density lipoproteins
'S4': 'TCH', # total cholesterol/ HDL
'S5': 'LTG', # lamotrigine level
'S6': 'GLU', # blood sugar level
'Y': 'Disease_progression' # measure of progress since 1yr of baseline
}
)

create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")

create a (trivial) pipeline for a random forest regressor
rnd_forest_reg = RegressorPipelineDF(
regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)
)

define grid of models which are "competing" against each other
rnd_forest_grid = [
LearnerGrid(
pipeline=rnd_forest_reg,
learner_parameters={
"min_samples_leaf": [8, 11, 15],
"max_depth": [4, 5, 6],
}
),
]

create repeated k-fold CV iterator
rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

rank your candidate models by performance (default is mean CV score - 2*SD)
ranker = LearnerRanker(
grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
).fit(sample=diabetes_sample)

get summary report
ranker.summary_report()


It is the last line, ranker.summary_report() that produces the error
AttributeError: 'LearnerRanker' object has no attribute '_ensure_fitted'


Indeed, when I check the presence of the attribute it yields 'False':
_print(hasattr(LearnerRanker, 'ensure_fitted'))

If I check the presence of ensure_fitted instead of _ensure_fitted, it yields 'True':
print(hasattr(LearnerRanker, 'ensure_fitted'))

Reduce pytest warnings

Describe the bug
Running pytest on test suite currently produces 25 warnings on develop.

To Reproduce
See here for the latest run via Azure Pipelines (or run pytest locally).

Expected behavior
No warnings, if possible.

Add methods to model inspector to return SHAP values and associated feature data

Is your feature request related to a problem? Please describe.
After using the model inspector it would be helpful to have an easy way to access the SHAP values and data for use with plotting methods that are part of the base shap library in python.

Describe the solution you'd like
Add methods to the model inspector to allow users to obtain the SHAP values and associated dataset

Describe alternatives you've considered
None.

Additional context
None.

Support for scikit-learn models

Is your feature request related to a problem? Please describe.
I've encountered an issue when trying to use scikit-learn models with the facet library, specifically with the LearnerInspector class. The library currently only supports models of type SupervisedLearnerPipelineDF or SupervisedLearnerDF.

Describe the solution you'd like
I'd appreciate it if the facet library could extend support to include scikit-learn models. The API could potentially look like this:

from facet.inspection import LearnerInspector
inspector = LearnerInspector(
    model=scikit_learn_model_instance,
    n_jobs=-3
).fit(sample_data)

In this example, scikit_learn_model_instance could be any model instance from the scikit-learn library (RanfomForestRegressor, ).

Describe alternatives you've considered
One alternative is to manually adapt scikit-learn models to a format compatible with the facet library. However, native support would streamline the process, especially for those heavily using scikit-learn.

Additional context
Extending support for scikit-learn models would enhance the utility of the facet library, especially for data scientists and ML engineers who frequently use scikit-learn.

cannot import LearnerInspector etc

Describe the bug
Not exhaustive list, but cannot run:

from facet.inspection import LearnerInspector
from facet.selection import LearnerRanker, LearnerGrid

But this runs:
from facet.data import Sample
Which means gamma-facet is installed.

I copy/paste this code from the Facet github repo: https://github.com/BCG-Gamma/facet

Issues with UnivariateProbabilitySimulator and missing values/scale y

Describe the bug
The current UnivariateProbabilitySimulator() appears to have two issues:

  1. The simulator fails to create partitions when the feature being simulated contains missing values - ValueError: cannot convert float NaN to integer
  2. If the feature does not have missing values an error an error is thrown when trying to work out the scale for y - TypeError: '>' not supported between instances of 'float' and 'method'

To Reproduce
To reproduce these errors please got to the branch: https://github.com/BCG-Gamma/facet/tree/docs/notebook_updates
and within the sphinx > source > tutorial folder run the notebook: https://github.com/BCG-Gamma/facet/blob/docs/notebook_updates/sphinx/source/tutorial/Prediabetes_classification_with_Facet.ipynb

Expected behavior
In both cases I expect to be able to get a complete figure with simulated trend and CIs for a feature displayed correctly.

Screenshots

First error:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-ea5a7119572e> in <module>
      2 simulator = UnivariateProbabilitySimulator(crossfit=ranker.best_model_crossfit, n_jobs=-1)
      3 partitioner = ContinuousRangePartitioner()
----> 4 univariate_simulation = simulator.simulate_feature(name=sim_feature, partitioner=partitioner)

C:\Projects\facet\facet\src\facet\simulation\_simulation.py in simulate_feature(self, name, partitioner)
    182             raise NotImplementedError("multi-output simulations are not supported")
    183 
--> 184         simulation_values = partitioner.fit(sample.features.loc[:, name]).partitions()
    185         simulation_results = self._aggregate_simulation_results(
    186             results_per_split=self._simulate_feature_with_values(

C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in fit(self, values, lower_bound, upper_bound, **fit_params)
    202         # calculate the step count based on the maximum number of partitions,
    203         # rounded to the next-largest rounded value ending in 1, 2, or 5
--> 204         self._step = step = self._step_size(lower_bound, upper_bound)
    205 
    206         # calculate centre values of the first and last partition;

C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in _step_size(self, lower_bound, upper_bound)
    334     def _step_size(self, lower_bound: float, upper_bound: float) -> float:
    335         return RangePartitioner._ceil_step(
--> 336             (upper_bound - lower_bound) / (self.max_partitions - 1)
    337         )
    338 

C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in _ceil_step(step)
    294             raise ValueError("arg step must be positive")
    295 
--> 296         return min(10 ** math.ceil(math.log10(step * m)) / m for m in [1, 2, 5])
    297 
    298     @staticmethod

C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in <genexpr>(.0)
    294             raise ValueError("arg step must be positive")
    295 
--> 296         return min(10 ** math.ceil(math.log10(step * m)) / m for m in [1, 2, 5])
    297 
    298     @staticmethod

ValueError: cannot convert float NaN to integer

Second error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-22d13aac8dd5> in <module>
----> 1 SimulationDrawer().draw(data=univariate_simulation, title=sim_feature)

C:\Projects\facet\facet\src\facet\simulation\viz\_draw.py in draw(self, data, title)
     73         if title is None:
     74             title = f"Simulation: {data.feature}"
---> 75         super().draw(data=data, title=title)
     76 
     77     @classmethod

C:\Projects\facet\pytools\src\pytools\viz\_viz.py in draw(self, data, title)
    104             # noinspection PyProtectedMember
    105             style._drawing_start(title)
--> 106             self._draw(data)
    107             # noinspection PyProtectedMember
    108             style._drawing_finalize()

C:\Projects\facet\facet\src\facet\simulation\viz\_draw.py in _draw(self, data)
     96             partitions=simulation_series.partitions,
     97             frequencies=simulation_series.frequencies,
---> 98             is_categorical_feature=data.partitioner.is_categorical,
     99         )
    100 

C:\Projects\facet\facet\src\facet\simulation\viz\_style.py in draw_uplift(self, feature, target, values_label, values_median, values_min, values_max, values_baseline, percentile_lower, percentile_upper, partitions, frequencies, is_categorical_feature)
    178 
    179         # add a horizontal line at y=0
--> 180         ax.axhline(y=values_baseline, linewidth=0.5)
    181 
    182         # remove the top and right spines

C:\Anaconda3\envs\facet-develop\lib\site-packages\matplotlib\axes\_axes.py in axhline(self, y, xmin, xmax, **kwargs)
    860         self._process_unit_info(ydata=y, kwargs=kwargs)
    861         yy = self.convert_yunits(y)
--> 862         scaley = (yy < ymin) or (yy > ymax)
    863 
    864         trans = self.get_yaxis_transform(which='grid')

TypeError: '>' not supported between instances of 'float' and 'method'

Desktop (please complete the following information):

  • Windows
  • Chrome

Trouble importing LearnerInspector

Describe the bug
I have trouble importing the LearnerInspector. When I try to import it, it throws the following error:
name 'catboost' is not defined

The code I use for this import is:
from facet.inspection import LearnerInspector
Update: the error also appears when I try to import facet.inspection (see full import output down below)

However, all of these import and (seem to) work without an issue:
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
from facet.validation import BootstrapCV
from facet.data.partition import ContinuousRangePartitioner
from facet.simulation import UnivariateProbabilitySimulator
from facet.simulation.viz import SimulationDrawer
from facet.crossfit import LearnerCrossfit

Desktop (please complete the following information):

  • OS: Windows 10 pro
  • Spyder, installed via Anaconda
  • Python: 3.9.7
  • facet: 2.0.dev0
  • catboost: 0.26.1

Down below is the complete output, when I try to import face.inspection:

import facet.inspection
Traceback (most recent call last):

Input In [19] in <cell line: 1>
import facet.inspection

File ~\Anaconda3\lib\site-packages\facet\inspection_init_.py:8 in
from ._explainer import *

File ~\Anaconda3\lib\site-packages\facet\inspection_explainer.py:346 in
__tracker.validate()

File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:200 in validate
update_forward_references(obj, globals_=globals_)

File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:328 in update_forward_references
_update(obj)

File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:315 in _update
_update(member, local_ns=local_ns)

File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:319 in _update
_update_annotations(_obj, local_ns)

File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:322 in _update_annotations
annotations = get_type_hints(

File ~\Anaconda3\lib\typing.py:1469 in get_type_hints
value = _eval_type(value, globalns, localns)

File ~\Anaconda3\lib\typing.py:292 in _eval_type
ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args)

File ~\Anaconda3\lib\typing.py:292 in
ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args)

File ~\Anaconda3\lib\typing.py:290 in _eval_type
return t._evaluate(globalns, localns, recursive_guard)

File ~\Anaconda3\lib\typing.py:551 in _evaluate
eval(self.forward_code, globalns, localns),

File :1 in

NameError: name 'catboost' is not defined

Add a UnivariateTargetSimulator

Is your feature request related to a problem? Please describe.
It would be great to have a simulator for absolute target values instead of making the uplift computation for regressors

Describe the solution you'd like
Being able to call a UnivariateTargetSimulator instead of a UnivariateUpliftSimulator

AssertionError: shap interaction values of binary classifiers must add up to 0.0 for each observation and feature pair

Describe the bug
When inspecting a binary classifier, the raw_shap_tensor of class 0 does not equal to -raw_shap_tensor of class 1.
It appears that the absolute difference can reach up to 10^-2.

Bug rises in function raw_shap_to_df

To Reproduce
Steps to reproduce the behavior:

  1. Go to the NHO facet modelisation and run the 4-Facet-modeling-NewAPI
  2. Try fitting the LearnerInspector instance
  3. See error:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
Screenshot 2020-08-14 at 16 07 01

Desktop (please complete the following information):

  • OS: [e.g. iOS] IOS
  • Browser: Brave

Inspector.fit: AssertionError for GradientBoostingClassifierDF() outputs

Describe the bug
Inspector fails to fit for a GradientBoostingClassifier - throws an AssertionError: 1 outputs named ['0', '1']. this is the only classifier for which I have observed this behaviour.

To Reproduce
Using a notebook within the facet-develop env the following code will reproduce the error:

# imports
import pandas as pd
import numpy as np
from facet import Sample
from facet.inspection import LearnerInspector
from facet.selection import LearnerGrid, LearnerRanker
from facet.validation import BootstrapCV
from sklearndf.pipeline import ClassifierPipelineDF
from sklearndf.classification import GradientBoostingClassifierDF
from sklearn.datasets import make_classification

# simulate some data
X, y = make_classification(n_samples=200, n_features=5, n_informative=5, n_redundant=0, random_state=42)
y_df = pd.DataFrame(y, columns=['target'])
X_df = pd.DataFrame(X, columns=['f1', 'f2', 'f3', 'f4', 'f5'])
sim_df = pd.concat([X_df, y_df], axis=1)

# create sample object
sample_df = Sample(observations=sim_df, target='target')

# create grid
grid = [
     LearnerGrid(
        pipeline=ClassifierPipelineDF(classifier=GradientBoostingClassifierDF(random_state=42)),
        learner_parameters={}
    )
]

# fit the learner ranker
ranker = LearnerRanker(grids=grid,
                       cv=BootstrapCV(n_splits=5, random_state=42),
                       n_jobs=-1,
                       verbose = 2,
                       scoring= "roc_auc").fit(sample=sample_df)

# fit the inspector
LearnerInspector(n_jobs=-1).fit(crossfit=ranker.best_model_crossfit)

You should then see the error:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\externals\loky\process_executor.py", line 418, in _process_worker
    r = call_item()
  File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\externals\loky\process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py", line 523, in _shap_for_split
    ), f"{len(shap_interaction_tensors)} outputs named {multi_output_names}"
AssertionError: 1 outputs named ['0', '1']
"""

The above exception was the direct cause of the following exception:

AssertionError                            Traceback (most recent call last)
<ipython-input-3-7b5a54a21400> in <module>
     18 
     19 # fit the inspector
---> 20 LearnerInspector(n_jobs=-1).fit(crossfit=ranker.best_model_crossfit)

c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_inspection.py in fit(self, crossfit, **fit_params)
    244             shap_decomposer = ShapValueDecomposer()
    245 
--> 246         shap_calculator.fit(crossfit=crossfit)
    247         shap_decomposer.fit(shap_calculator=shap_calculator)
    248 

c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py in fit(self, crossfit, **fit_params)
    129         # calculate shap values and re-order the observation index to match the
    130         # sequence in the original training sample
--> 131         shap_all_splits_df: pd.DataFrame = self._shap_all_splits(crossfit=crossfit)
    132 
    133         assert shap_all_splits_df.index.nlevels > 1

c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py in _shap_all_splits(self, crossfit)
    226                         else (
    227                             training_sample.subsample(iloc=oob_split)
--> 228                             for _, oob_split in crossfit.splits()
    229                         )
    230                     ),

C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
    932 
    933             with self._backend.retrieval_context():
--> 934                 self.retrieve()
    935             # Make sure that we get a last message telling us we are done
    936             elapsed_time = time.time() - self._start_time

C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py in retrieve(self)
    831             try:
    832                 if getattr(self._backend, 'supports_timeout', False):
--> 833                     self._output.extend(job.get(timeout=self.timeout))
    834                 else:
    835                     self._output.extend(job.get())

C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
    519         AsyncResults.get from multiprocessing."""
    520         try:
--> 521             return future.result(timeout=timeout)
    522         except LokyTimeoutError:
    523             raise TimeoutError()

C:\Anaconda3\envs\nho_powerco\lib\concurrent\futures\_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

C:\Anaconda3\envs\nho_powerco\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

AssertionError: 1 outputs named ['0', '1']

Expected behavior
Expect the inspector to not throw an error and allow subsequent access to redundancy, synergy etc.

Desktop (please complete the following information):

  • Windows
  • Chrome

Request to add UnivariateProbabilitySimulator

Is your feature request related to a problem? Please describe.
Hey, I saw that the UnivariateProbabilitySimulator is scheduled for a future release. It would be beneficial for the NHO tutorials to have it available.

Describe the solution you'd like
We would like to use a Univariate simulation for change in average predicted probability (CAPP) based on a classification model.

Mismatch of feature ordering (matrices vs. dendograms)

Is your feature request related to a problem? Please describe.
For the feature affinity matrices (redundancy matrix, synergy matrix, association matrix), it's super helpful to have features already in an order to highlight clusters visually. However, in the dendograms the order differs (due to an additional sorting step with respect to feature importance).

Describe the solution you'd like
Apply the dendogram order to the matrices, so you can "rediscover" the already identified clusters when switching visualizations.

Describe alternatives you've considered
Could also adjust the dendogram ordering to match the current matrix ordering, but I think it's helpful to guide the ordering by feature importance, also for the matrices.

Emit warning if simulations are run with too few splits

Is your feature request related to a problem? Please describe.
Simulation confidence intervals are computed through bootstrapping. If too few bootstrap splits are used then the confidence intervals are not reliable.

Describe the solution you'd like
In the UnivatiateSimulator classes, emit a warning if the number of bootstrap splits falling above or below the confidence interval is less than 25.

For example, given 1000 bootstrap splits and a confidence interval of (2.5%,97.5%) we will have 25 splits below the 2.5% threshold and 25 splits above the 97.5% threshold so no warning will be emitted.

Using fewer splits will generate a warning, recommending to tighten the confidence interval or to increase the number of splits.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.