Git Product home page Git Product logo

teamhg-memex / eli5 Goto Github PK

View Code? Open in Web Editor NEW
2.7K 67.0 330.0 36.55 MB

A library for debugging/inspecting machine learning classifiers and explaining their predictions

Home Page: http://eli5.readthedocs.io

License: MIT License

Python 5.93% HTML 0.10% Jupyter Notebook 93.95% Shell 0.02%
scikit-learn machine-learning xgboost lightgbm crfsuite inspection explanation nlp data-science python

eli5's Introduction

ELI5

PyPI Version Build Status Code Coverage Documentation

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions.

explain_prediction for text data

explain_prediction for image data

It provides support for the following machine learning frameworks and packages:

  • scikit-learn. Currently ELI5 allows to explain weights and predictions of scikit-learn linear classifiers and regressors, print decision trees as text or as SVG, show feature importances and explain predictions of decision trees and tree-based ensembles. ELI5 understands text processing utilities from scikit-learn and can highlight text data accordingly. Pipeline and FeatureUnion are supported. It also allows to debug scikit-learn pipelines which contain HashingVectorizer, by undoing hashing.
  • Keras - explain predictions of image classifiers via Grad-CAM visualizations.
  • xgboost - show feature importances and explain predictions of XGBClassifier, XGBRegressor and xgboost.Booster.
  • LightGBM - show feature importances and explain predictions of LGBMClassifier and LGBMRegressor.
  • CatBoost - show feature importances of CatBoostClassifier, CatBoostRegressor and catboost.CatBoost.
  • lightning - explain weights and predictions of lightning classifiers and regressors.
  • sklearn-crfsuite. ELI5 allows to check weights of sklearn_crfsuite.CRF models.

ELI5 also implements several algorithms for inspecting black-box models (see Inspecting Black-Box Estimators):

  • TextExplainer allows to explain predictions of any text classifier using LIME algorithm (Ribeiro et al., 2016). There are utilities for using LIME with non-text data and arbitrary black-box classifiers as well, but this feature is currently experimental.
  • Permutation importance method can be used to compute feature importances for black box estimators.

Explanation and formatting are separated; you can get text-based explanation to display in console, HTML version embeddable in an IPython notebook or web dashboards, a pandas.DataFrame object if you want to process results further, or JSON version which allows to implement custom rendering and formatting on a client.

License is MIT.

Check docs for more.


define hyperiongray

eli5's People

Contributors

ashwinb-hat avatar guillemgsubies avatar ivanprado avatar jnothman avatar kmike avatar krkd avatar lopuhin avatar mehaase avatar rg2410 avatar rmax avatar teabolt avatar zzz4zzz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eli5's Issues

unhashing: sign of a feature can be confusing in case of collisions

A follow-up to #10 and #18: when deciding if a feature should be in top positive or in top negative features we should take in account sign of the most popular term, e.g. instead of

(-)people | considered | approximately +1.739 (as it is now)

it should be better to show

people | (-)considered | (-)approximately -1.739

allow to filter features by their names

Sometimes it is useful to check coefficients only for some of the features. For example, here (scroll down to "What are important features?") one may want to check how e.g. query:... features affect the result, without looking at all other features. This also can be helpful when adding a new feature.

What about adding 'feature_re' or 'feature_patterns' argument to explain_weights functions?

add IPython interactive widget

A widget may allow to change options, e.g.:

  • change a number of features to show;
  • show only some of the classes;
  • filter features by name;
  • switch between layouts;
  • etc.

JSON serialization of Explanation

I think it makes sense to add something like asdict method to Explanation that will return a JSON-serializable object (it will just call attr.asdict(self)).
And also add test that check that it is indeed json-serializable (right now it can have some numpy ints that are not seriazable).

drop scikit-learn 0.17.x support

I tried to add tests for scikit-learn 0.17, but it turns out compatibility shims in eli5.lime don't work - e.g. KFold has different API. What do you think about dropping scikit-learn 0.17 support, and supporting only 0.18.x? //cc @lopuhin

Negative feature weights have different order in text and html

Order in text is wrong:

 $ py.test tests/test_sklearn_explain_prediction.py::test_explain_linear_regression[reg0] -s
============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.5.1, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/kostia/shub/memex/eli5, inifile: 
plugins: hypothesis-3.4.2
collected 25 items 

tests/test_sklearn_explain_prediction.py {'estimator': 'ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, '
              'l1_ratio=0.5,\n'
              '      max_iter=1000, normalize=False, positive=False, '
              'precompute=False,\n'
              "      random_state=42, selection='cyclic', tol=0.0001, "
              'warm_start=False)',
 'method': 'linear model',
 'targets': [{'feature_weights': {'neg': [('x10', -19.656206335733643),
                                          ('x12', -16.947217711388856),
                                          ('x9', -3.368443508747657),
                                          ('x7', -0.73147197826808674)],
                                  'neg_remaining': 0,
                                  'pos': [('<BIAS>', 38.96972344614295),
                                          ('x5', 6.8348858609128671),
                                          ('x11', 4.8082096167385444),
                                          ('x8', 1.8485323743243427),
                                          ('x0', 0.23929256935816867)],
                                  'pos_remaining': 0},
              'score': 11.997304333338633,
              'target': 'y'}]}
Explained as: linear model
'y' (score=11.997) top features
----------------
 +38.970  <BIAS>
  +6.835  x5    
  +4.808  x11   
  +1.849  x8    
  +0.239  x0    
 -19.656  x10   
 -16.947  x12   
  -3.368  x9    
  -0.731  x7    

It is hard to customize formatting in IPython notebook

Currently in order to change formatting options in IPython notebook user has to do something like this:

from IPython.display import HTML
expl = explain_weights(clf, vec=fe, top=20)
HTML(format_as_html(expl, highlight_spaces=False, horizontal_layout=False))

It'd be nice to reduce it to a one-liner.

add helpers for non-text data to eli5.lime

Extra white borders in html table for feature importances

At least when the table has no extra styles. Reproducing:

py.test tests/test_sklearn_explain_weights.py::test_explain_random_forest -s
open .html/test_sklearn_explain_weights_test_explain_random_forest_RandomForestClassifier.html

2016-12-14 18 52 34

TODO:

  • check weights table styles
  • check styles in ipython notebook

Make _weight_range and _weight_color functions from formatters.html public

And maybe also some other functions? They are needed if we want to render weights in html similar to how it is done in the html formatter.
Another option would be to use an object instead of (name, weight) tuple, and add hsl_color attribute to it. I'm not sure which is better, making functions public feels less committing.

Unstable test test_lime_utils.py::test_fit_proba

https://travis-ci.org/TeamHG-Memex/eli5/jobs/173112065 - I think this is the same failure I already saw, I added random_state but it did not help:

=================================== FAILURES ===================================
________________________________ test_fit_proba ________________________________
    def test_fit_proba():
        X = np.array([
            [0.0, 0.8],
            [0.0, 0.5],
            [1.0, 0.1],
            [0.9, 0.2],
            [0.7, 0.3],
        ])
        y_proba = np.array([
            [0.0, 1.0],
            [0.1, 0.9],
            [1.0, 0.0],
            [0.55, 0.45],
            [0.4, 0.6],
        ])
        y_bin = y_proba.argmax(axis=1)
    
        # fit on binary labels
        clf = LogisticRegression(C=10, random_state=42)
        clf.fit(X, y_bin)
        y_pred = clf.predict_proba(X)[:,1]
        mae = mean_absolute_error(y_proba[:,1], y_pred)
        print(y_pred, mae)
    
        # fit on probabilities
        clf2 = LogisticRegression(C=10, random_state=42)
        fit_proba(clf2, X, y_proba, expand_factor=200)
        y_pred2 = clf2.predict_proba(X)[:,1]
        mae2 = mean_absolute_error(y_proba[:,1], y_pred2)
        print(y_pred2, mae2)
    
        assert mae2 * 1.2 < mae
    
        # let's get 3th example really right
        sample_weight = np.array([0.1, 0.1, 0.1, 10.0, 0.1])
        clf3 = LogisticRegression(C=10, random_state=42)
        fit_proba(clf3, X, y_proba, expand_factor=200, sample_weight=sample_weight)
        y_pred3 = clf3.predict_proba(X)[:,1]
        print(y_pred3)
    
        val = y_proba[3][1]
        assert abs(y_pred3[3] - val) * 1.5 < abs(y_pred2[3] - val)
>       assert abs(y_pred3[3] - val) * 1.5 < abs(y_pred[3] - val)
E       assert (0.077946544208881308 * 1.5) < 0.10327808741270417
E        +  where 0.077946544208881308 = abs((0.3720534557911187 - 0.45000000000000001))
E        +  and   0.10327808741270417 = abs((0.34672191258729584 - 0.45000000000000001))
tests/test_lime_utils.py:53: AssertionError
----------------------------- Captured stdout call -----------------------------
[ 0.92137462  0.87156298  0.26152978  0.34672191  0.49837953] 0.114698148448
[ 0.99854408  0.90620802  0.1122826   0.31398412  0.59140365] 0.0529117527887
[ 0.9862338   0.94839957  0.23016764  0.37205346  0.59652343]

show_weights with OneVsRestClassifier

Hi guys, I really like this tool! I have a pipeline say

mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
vec = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
clf = OneVsRestClassifier(LogisticRegressionCV())
pipeline = make_pipeline(vec, clf)
pipeline.fit(X_train, y_train)

show_prediction works neatly, but I run into 'LogisticRegressionCV' object has no attribute 'classes_' when calling eli5.show_weights(clf.estimator, vec=vec, target_names=mlb.classes_) or unsupported class if I use clf directly.

Is it possible to work around this problem or do you plan adding support for this soon?

Cheers!
Simon

defer generating dummy feature names in FeatureUnhasher

I think FeatureUnhasher.get_feature_names should have an option to use nan / None as feature names instead of generated FEATURE[%d] string names. Creating all these string is the slowest part of this code, and it looks unnecessary because printing/formatting code can easily generate missing feature names itself.

html features: preserve whitespaces

Features with whitespaces in front get these whitespaces removed in HTML.

Compare:

+2.837  spa 
+2.805   spa

and

2016-10-21 16 59 04

I think whitespaces should be replaced with &nbsp; for HTML display. It could also make sense to use another background for text, in order to show whitespaces in the end.

Text highlighting: should we preserve density?

When highlighting a feature, we can highlight it regardless of length (currently in master), or try to preserve density, so coloring longer feature with a less intense color. I tried that second approach in preserve-density branch, here are some screenshots with master behaviour on top (links to notebooks: https://github.com/TeamHG-Memex/eli5/blob/preserve-density/notebooks/explain_text_prediction.ipynb for words and https://github.com/TeamHG-Memex/eli5/blob/preserve-density/notebooks/explain_text_prediction_char.ipynb for chars).

2016-10-21 11 28 54

2016-10-21 11 31 22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.