Git Product home page Git Product logo

interpret-community's Introduction

Build Status MIT license versions PyPI

Interpret Community SDK

The Interpret Community extends Interpret, an open source python package from Microsoft Research for training interpretable models and helping to explain blackbox systems, with additional interpretability techniques and utility functions to handle real-world datasets and workflows.

This repository contains an SDK and Jupyter notebooks with examples to showcase its use.

Contents

Overview of Interpret-Community

Interpret-Community incorporates community developed interpretability techniques under one roof with a unified set of data structures and visualization. Users could experiment with different interpretability techniques and easily perform comparative analysis to evaluate their brand new explainers. Using these tools, one can explain machine learning models globally on all data, or locally on a specific prediction using the state-of-art technologies in an easy-to-use and scalable fashion. In particular, this open-source toolkit:

  1. Actively incorporates innovative interpretability techniques, and allows for further expansion by researchers and data scientists
  2. Creates a common API across the integrated libraries
  3. Applies optimizations to make it possible to run on real-world datasets at scale
  4. Provides improvements such as the capability to "reverse the feature engineering pipeline" to provide model insights in terms of the original raw features rather than engineered features
  5. Provides interactive and exploratory visualization to empower data scientists to gain significant insight into their data

Target Audience

  1. Developers/Data Scientists: Having all of the interpretability techniques in one place makes it easy for data scientists to experiment with different interpretability techniques, and explain their model in a scalable and seamless manner. The set of rich interactive visualizations allow developers and data scientists to train and deploy more transparent machine learning models instead of wasting time and effort on generating customized visualizations, addressing scalability issues by optimizing third-party interpretability techniques, and adopting/operationalizing interpretability techniques.
  2. Business Executives: The core logic and visualizations are beneficial for raising awareness among those involved in developing AI applications, allow them to audit model predictions for potential unfairness, and establish a strong governance framework around the use of AI applications.
  3. Machine Learning Interpretability Researchers: Interpret's extension hooks make it easy to extend and thus, interpretability researchers who are interested in adding their own techniques, can easily add them to the community repository and compare it to state-of-the-art and proven interpretability techniques and/or other community techniques.

Getting Started

This repository uses Anaconda to simplify package and environment management.

To setup on your local machine:

1. Set up Environment
a. Install Anaconda with Python >= 3.6
   [Miniconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) is a quick way to get started.


b. Create conda environment named interp and install packages
    conda create --name interp python=3.6 anaconda

Optional, additional reading:

conda cheat sheet jupyter nb_conda

On Linux and Windows: c. Activate conda environment
    activate interp
On Mac: c. Activate conda environment
    source activate interp
2. Clone the Interpret-Community repository

Clone and cd into the repository

git clone https://github.com/interpretml/interpret-community
cd interpret-community
3. Install Python module, packages and necessary distributions
pip install interpret-community

If you intend to run repository tests:

pip install -r requirements.txt
On Windows:

Pytorch installation if desired:

    pip install https://download.pytorch.org/whl/cpu/torch-1.3.0%2Bcpu-cp36-cp36m-win_amd64.whl
    pip install torchvision==0.4.1

lightgbm installation if desired:

    conda install --yes -c conda-forge lightgbm
On Linux: Pytorch installation if desired:
    pip install torch==1.3.0
    pip install torchvision==0.4.1

lightgbm installation if desired:

    conda install --yes -c conda-forge lightgbm
On MacOS:

Pytorch installation if desired:

    pip install torch==1.3.0
    pip install torchvision==0.4.1

lightgbm installation if desired (requires Homebrew):

    brew install libomp
    conda install --yes -c conda-forge lightgbm

If installing the package generally gives an error about the certifi package, run this first:

    pip install --upgrade certifi --ignore-installed
    pip install interpret-community
4. Set up and run Jupyter Notebook server

Install and run Jupyter Notebook

if needed:
          pip install jupyter
then:
jupyter notebook

Supported Models

This API supports models that are trained on datasets in Python numpy.array, pandas.DataFrame, iml.datatypes.DenseData, or scipy.sparse.csr_matrix format.

The explanation functions accept both models and pipelines as input, as long as the model or pipeline implements the prediction function predict or predict_proba that conforms to the Scikit convention. If a pipeline script is provided, the explanation function assumes that the running pipeline script returns a prediction. The repository also supports models trained via PyTorch, TensorFlow, and Keras deep learning frameworks.

Supported Explainers

The following are a list of the experimental explainers available in the community repository:

  • SHAP Tree Explainer: SHAP’s tree explainer, which focuses on polynomial time fast SHAP value estimation algorithm specific to trees and ensembles of trees.

  • SHAP Deep Explainer: Based on the explanation from SHAP, Deep Explainer "is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT described in the SHAP NIPS paper. TensorFlow models and Keras models using the TensorFlow backend are supported (there is also preliminary support for PyTorch)".

  • SHAP Kernel Explainer: SHAP's Kernel explainer uses a specially weighted local linear regression to estimate SHAP values for any model.

  • SHAP: SHAP's Linear Explainer computes SHAP values for a linear model, optionally accounting for inter-feature correlations.

  • Mimic Explainer: Mimic explainer is based on the idea of global surrogate models's. A global surrogate model is an intrinsically interpretable model that is trained to approximate the predictions of a black box model as accurately as possible. Data scientist can interpret the surrogate model to draw conclusions about the black box model. This repository supporots the following interpretable models as surrogate model: LightGBM (LGBMExplainableModel), Linear/Logistic Regression (LinearExplainableModel), Stochastic Gradient Descent explainable model (SGDExplainableModel), and Decision Tree (DecisionTreeExplainableModel).

  • Permutation Feature Importance Explainer: Permutation Feature Importance is a technique used to explain classification and regression models that is inspired by Breiman's Random Forests paper (section 10). At a high level, it works by randomly shuffling data one feature at a time for the entire dataset and calculating how much the performance metric of interest decreases. The larger the change, the more important that feature is.

  • LIME Explainer: LIME Explainer uses the state-of-the-art Local interpretable model-agnostic explanations (LIME) algorithm to create local surrogate models. Unlike the global surrogate models, LIME focuses on training local surrogate models to explain individual predictions.

  • Tabular Explainer: Used with tabular datasets, it currently employs the following logic to invoke the direct SHAP explainers:

    Original Model Invoked Explainer
    Tree-based models SHAP TreeExplainer
    Deep Neural Network models SHAP DeepExplainer
    Linear models SHAP LinearExplainer
    None of the above SHAP KernelExplainer

Use Interpret-Community

Interpretability in training

  1. Train your model in a Jupyter notebook running on your local machine.

    # load breast cancer dataset, a well-known small dataset that comes with scikit-learn
    from sklearn.datasets import load_breast_cancer
    from sklearn import svm
    from sklearn.model_selection import train_test_split
    breast_cancer_data = load_breast_cancer()
    classes = breast_cancer_data.target_names.tolist()
    
    # split data into train and test
    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(breast_cancer_data.data,            
                                                        breast_cancer_data.target,  
                                                        test_size=0.2,
                                                        random_state=0)
    clf = svm.SVC(gamma=0.001, C=100., probability=True)
    model = clf.fit(x_train, y_train)
  2. Call the explainer: To initialize an explainer object, you need to pass your model and some training data to the explainer's constructor. You can also optionally pass in feature names and output class names (if doing classification) which will be used to make your explanations and visualizations more informative. Here is how to instantiate an explainer object using TabularExplainer, MimicExplainer, or PFIExplainer locally. TabularExplainer calls one of the four SHAP explainers underneath (TreeExplainer, DeepExplainer, LinearExplainer, or KernelExplainer), and automatically selects the most appropriate one for your use case. You can however, call each of its four underlying explainers directly.

    from interpret.ext.blackbox import TabularExplainer
    
    # "features" and "classes" fields are optional
    explainer = TabularExplainer(model, 
                                 x_train, 
                                 features=breast_cancer_data.feature_names, 
                                 classes=classes)

    or

    from interpret.ext.blackbox import MimicExplainer
    
    # you can use one of the following four interpretable models as a global surrogate to the black box model
    
    from interpret.ext.glassbox import LGBMExplainableModel
    from interpret.ext.glassbox import LinearExplainableModel
    from interpret.ext.glassbox import SGDExplainableModel
    from interpret.ext.glassbox import DecisionTreeExplainableModel
    
    # "features" and "classes" fields are optional
    # augment_data is optional and if true, oversamples the initialization examples to improve surrogate model accuracy to fit original model.  Useful for high-dimensional data where the number of rows is less than the number of columns. 
    # max_num_of_augmentations is optional and defines max number of times we can increase the input data size.
    # LGBMExplainableModel can be replaced with LinearExplainableModel, SGDExplainableModel, or DecisionTreeExplainableModel
    explainer = MimicExplainer(model, 
                               x_train, 
                               LGBMExplainableModel, 
                               augment_data=True, 
                               max_num_of_augmentations=10, 
                               features=breast_cancer_data.feature_names, 
                               classes=classes)

    or

    from interpret.ext.blackbox import PFIExplainer 
    
    # "features" and "classes" fields are optional
    explainer = PFIExplainer(model, 
                             features=breast_cancer_data.feature_names, 
                             classes=classes)

The following two sections demonstrate how you can get global and local feature importance values. Local measures focus on the contribution of features for a specific prediction (e.g., why the model predicted an 80% chance of breast cancer for Mary?), whereas global measures take all predictions into account (Overall, what are the top K important features in predicting a high risk for breast cancer?):

Overall (Global) feature importance values

Get the global feature importance values.

# you can use the training data or the test data here
global_explanation = explainer.explain_global(x_train)

# if you used the PFIExplainer in the previous step, use the next line of code instead
# global_explanation = explainer.explain_global(x_train, true_labels=y_test)

# sorted feature importance values and feature names
sorted_global_importance_values = global_explanation.get_ranked_global_values()
sorted_global_importance_names = global_explanation.get_ranked_global_names()


# alternatively, you can print out a dictionary that holds the top K feature names and values
global_explanation.get_feature_importance_dict()

Instance-level (local) feature importance values

Get the local feature importance values: use the following function calls to explain an individual instance or a group of instances. Please note that PFIExplainer does not support local explanations.

# explain the first data point in the test set
local_explanation = explainer.explain_local(x_test[0])

# sorted feature importance values and feature names
sorted_local_importance_names = local_explanation.get_ranked_local_names()
sorted_local_importance_values = local_explanation.get_ranked_local_values()

or

# explain the first five data points in the test set
local_explanation = explainer.explain_local(x_test[0:5])

# sorted feature importance values and feature names
sorted_local_importance_names = local_explanation.get_ranked_local_names()
sorted_local_importance_values = local_explanation.get_ranked_local_values()

Raw feature transformations

Optionally, you can pass your feature transformation pipeline to the explainer to receive explanations in terms of the raw features before the transformation (rather than engineered features). If you skip this, the explainer provides explanations in terms of engineered features.

The format of supported transformations is same as the one described in sklearn-pandas. In general, any transformations are supported as long as they operate on a single column and are therefore clearly one to many.

We can explain raw features by either using a sklearn.compose.ColumnTransformer or a list of fitted transformer tuples. The cell below uses sklearn.compose.ColumnTransformer.

from sklearn.compose import ColumnTransformer

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# append classifier to preprocessing pipeline.
# now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])


# append classifier to preprocessing pipeline.
# now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])


# clf.steps[-1][1] returns the trained classification model
# pass transformation as an input to create the explanation object
# "features" and "classes" fields are optional
tabular_explainer = TabularExplainer(clf.steps[-1][1],
                                     initialization_examples=x_train,
                                     features=dataset_feature_names,
                                     classes=dataset_classes,
                                     transformations=preprocessor)

In case you want to run the example with the list of fitted transformer tuples, use the following code:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn_pandas import DataFrameMapper

# assume that we have created two arrays, numerical and categorical, which holds the numerical and categorical feature names

numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(
    strategy='median')), ('scaler', StandardScaler())])) for f in numerical]

categorical_transformations = [([f], OneHotEncoder(
    handle_unknown='ignore', sparse=False)) for f in categorical]

transformations = numeric_transformations + categorical_transformations

# append model to preprocessing pipeline.
# now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations)),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

# clf.steps[-1][1] returns the trained classification model
# pass transformation as an input to create the explanation object
# "features" and "classes" fields are optional
tabular_explainer = TabularExplainer(clf.steps[-1][1],
                                     initialization_examples=x_train,
                                     features=dataset_feature_names,
                                     classes=dataset_classes,
                                     transformations=transformations)

Visualizations

Load the visualization dashboard in your noteboook to understand and interpret your model:

Global visualizations

The following plots provide a global view of the trained model along with its predictions and explanations.

Plot Description
Data Exploration An overview of the dataset along with prediction values.
Global Importance Shows the top K (configurable K) important features globally. This chart is useful for understanding the global behavior of the underlying model.
Explanation Exploration Demonstrates how a feature is responsible for making a change in model’s prediction values (or probability of prediction values).
Summary Uses a signed local feature importance values across all data points to show the distribution of the impact each feature has on the prediction value.

Visualization Dashboard Global

Local visualizations

You can click on any individual data point at any time of the preceding plots to load the local feature importance plot for the given data point.

Plot Description
Local Importance Shows the top K (configurable K) important features globally. This chart is useful for understanding the local behavior of the underlying model on a specific data point.
Perturbation Exploration Allows you to change feature values of the selected data point and observe how those changes will affect prediction value.
Individual Conditional Expectation (ICE) Allows you to change a feature value from a minimum value to a maximum value to see how the data point's prediction changes when a feature changes.

Visualization Dashboard Local Feature Importance

Visualization Dashboard Feature Perturbation

Visualization Dashboard ICE Plots

To load the visualization dashboard, use the following code:

from interpret_community.widget import ExplanationDashboard

ExplanationDashboard(global_explanation, model, datasetX=x_test)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to the Github Developer Certificate of Origin, DCO. For details, please visit https://probot.github.io/apps/dco/.

The Developer Certificate of Origin (DCO) is a lightweight way for contributors to certify that they wrote or otherwise have the right to submit the code they are contributing to the project. Here is the full text of the DCO, reformatted for readability:

By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
Contributors sign-off that they adhere to these requirements by adding a Signed-off-by line to commit messages.
This is my commit message

Signed-off-by: Random J Developer <[email protected]>
Git even has a -s command line option to append this automatically to your commit message:
$ git commit -s -m 'This is my commit message'

When you submit a pull request, a DCO bot will automatically determine whether you need to certify. Simply follow the instructions provided by the bot.

Code of Conduct

This project has adopted the his project has adopted the GitHub Community Guidelines.

Reporting Security Issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) at [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

interpret-community's People

Contributors

bethz avatar dependabot[bot] avatar eedeleon avatar imatiach-msft avatar marksoper avatar mesameki avatar microsoftopensource avatar mizeljko avatar msftgits avatar rihorn2 avatar rogerhe avatar vingu avatar vmagelo avatar wamartin-aml avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.