Git Product home page Git Product logo

te2rules's Introduction

TE2Rules

Paper PyPI Downloads PyPI

TE2Rules is a technique to explain Tree Ensemble models (TE) like XGBoost, Random Forest, trained on a binary classification task, using a rule list. The extracted rule list (RL) captures the necessary and sufficient conditions for classification by the Tree Ensemble. The algorithm used by TE2Rules is based on Apriori Rule Mining. For more details on the algorithm, please check out our paper.

TE2Rules provides a ModelExplainer which takes a trained TE model and training data to extract rules. The training data is used for extracting rules with relevant combination of input features. Without data, an explainer would have to extract rules for all possible combinations of input features, including those combinations which are extremely rare in the data.

Installation

TE2Rules package is available on PyPI and can be installed with pip:

pip install te2rules

Documentation

The official documentation of TE2Rules can be found at te2rules.readthedocs.io.

TE2Rules contains a ModelExplainer class with an explain() method which returns a rule list corresponding to the positive class prediciton of the tree ensemble. While using the rule list, any data instance that does not trigger any of the extracted rules is to be interpreted as belonging to the negative class. The explain() method has tunable parameters to control the interpretability, faithfulness, runtime and coverage of the extracted rules. These are:

  • min_precision: min_precision controls the minimum precision of extracted rules. Setting it to a smaller threshold, allows extracting shorter (more interpretable, but less faithful) rules. By default, the algorithm uses a minimum precision threshold of 0.95.

  • num_stages: The algorithm runs in stages starting from stage 1, stage 2 to all the way till stage n where n is the number of trees in the ensemble. Stopping the algorithm at an early stage results in a few short rules (with quicker run time, but less coverage in data). It is recommended to try running 2 or 3 stages. By default, the algorithm explores all stages before terminating.

  • jaccard_threshold: This parameter (between 0 and 1) controls how rules from different node combinations are combined. As the algorithm proceeds in stages, combining two similar rules from the previous stage results in yet another similar rule in the next stage. This is not useful in finding new rules that explain the model. Setting jaccard_threshold to a smaller value, speeds up the algorithm by only combining dissimilar rules from the previous stages. By default, the algorithm uses a jaccard threshold of 0.20.

For evaluating the performance of the extracted rule list, the ModelExplainer provides a method get_fidelity() which returns the fractions of data for which the rule list agrees with the tree ensemble. get_fidelity() returns the fidelity on positives, negatives and overall fidelity.

Usage

The following notebook shows a typical use case of TE2Rules on Adult Income Data. The notebook can be found here. Let us start with importing te2rules and other relevant libraries TE2Rules Adult Screenshot1

Let us load the training and testing data. All the data used in this notebook are preprocessed already and can be found here. The data can also be generated by running python3 data_prep/data_prep_adult.py. TE2Rules Adult Screenshot2

The tree ensemble model used in this notebook is a XGBoost model with 10 trees. TE2Rules Adult Screenshot3

Let us use TE2Rules ModelExplainer to explain the positive class prediciton by the XGBoost model. We observe that TE2Rules extracts 5 rules to explain more than 99% of the positive class prediction by the tree ensemble model. In this usgae, we use the default values of min_precision (0.95) and num_stages (10), since the algorithm runs quickly for a tree ensemble with 10 trees. TE2Rules Adult Screenshot4 TE2Rules Adult Screenshot5

These are not the only possible way to explain the model prediction. TE2Rules generates all possible explanations that can be extracted out of the tree ensemble model and then selects a short subset of rules out of all the possible rules such that the selected subset is small while explaining most of the positive class prediction by the model.

However, we need not settle with the subset of rules selected by TE2Rules. To see all the possible explanations, we can use longer_rules as shown below. From this set of rules, a domain expert can go through them and select their own small subset of rules that explains most of the positive model predictions and also aligns with the decision-making process often used in that domain. This way TE2Rules offers the flexibility to choose the explanations that most closely aligns with the human decision making process. TE2Rules Adult Screenshot6a TE2Rules Adult Screenshot6b

The longer set of all possible explanations would have significant overlap among the rules. For a given input to the model, multiple rules satisfying the input can be used to explain the input. The explain_instance_with_rules method provides one way to show all the possible rules that can explain each input instance. Again, a domain expert can pick the rule that is most suitable to explain that instance. TE2Rules Adult Screenshot7 TE2Rules Adult Screenshot8a TE2Rules Adult Screenshot8b TE2Rules Adult Screenshot8c TE2Rules Adult Screenshot8d

Some popular applications of TE2Rules

Here is a list of some applications of TE2Rules in high-stake domains like healthcare, where understanding why a ML model made a particular decision is critical in developing trust on the system.

  • An Explainable Artificial Intelligence model in the assessment of Brain MRI Lesions in Multiple Sclerosis using Amplitude Modulation โ€“ Frequency Modulation multi-scale feature sets, By Andria Nicolaou, Antonis Kakas et al, In 2023 24th International Conference on Digital Signal Processing (DSP), https://ieeexplore.ieee.org/abstract/document/10167888
  • Emergency Department Triage Hospitalization Prediction Based on Machine Learning and Rule Extraction, By Waqar A. Sulaiman, Andria Nicolaou et al, In 2023 IEEE EMBS Special Topic Conference on Data Science and Engineering in Healthcare, Medicine and Biology, https://ieeexplore.ieee.org/abstract/document/10405176
  • An Explainable AI model in the assessment of Multiple Sclerosis using clinical data and Brain MRI lesion texture features, By A. Nicolaou, M. Pantzaris et al, In 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), https://ieeexplore.ieee.org/abstract/document/10313379
  • A Comparative Study of Explainable AI models in the Assessment of Multiple Sclerosis, By Andria Nicolaou, Nicoletta Prentzas et al, In 2023 Computer Analysis of Images and Patterns (CAIP 2023), https://link.springer.com/chapter/10.1007/978-3-031-44240-7_14

For reproducing results in the paper

Run the follwing python scripts to generate the results in the paper:

python3 demo/demo/run_te2rules.py
python3 demo/demo/run_defrag.py
python3 demo/demo/run_intrees.py
python3 plot_performance.py
python3 plot_scalability.py

License

Creative Commons Attribution-NonCommercial 4.0 International Public License, see LICENSE for more details.

Citation

Please cite TE2Rules in your publications if it helps your research:

@article{te2rules2022,
  title={TE2Rules: Explaining Tree Ensembles using Rules},
  author={Lal, G Roshan and Chen, Xiaotong and Mithal, Varun},
  journal={arXiv preprint arXiv:2206.14359},
  year={2022}
}

te2rules's People

Contributors

groshanlal avatar lelomau avatar xtchen64 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

te2rules's Issues

The limitation of feature-names

Hello dears

I am a data scientist, and I'm working on extracting rules from different models such as Random-Forst or XGBoost
Fortunately, I found this repo and now I can develop my system

But there are some issues;
For example, some of our feature names are like this 'F01-F02', and there are some characters out of ASCII chars
If it's possible, remove the limitation of only ASCII chars in feature names, or let me do that myself

Another thing that I've faced, is speed!
I am exporting rules from many models thousands of times
Is there any way to make the process faster?

Imbalanced dataset

Hi, I am really interested in this algorithm
I have experimented this repo our real world problem and found some interesting insight.

However, there is a case while the rules output are not optimal, hope to get your advise.

rules = model_explainer.explain(
    X=model['preprocessor'].transform(X_train), y=y_train_pred,
    num_stages = 10,
    min_precision = 0.95
)

print(str(len(rules)) + " rules found")
for i in range(len(rules)):
    print("Rule " + str(i) + ": " + str(rules[i]))

2 rules found
Rule 0: AIDMM2 <= 0.5
Rule 1: AIDMM2 > 0.5

AIDMM2 is a categorical feature and it has been transformed into numerical value (only 0 and 1).
Our dataset is extremely imbalance so the output rules might be like this =))

Extracting Rules

To Whom It May Concern,

I have been facing some issue regarding extracting rules. I am working on a binary classification problem wherein i am classifying emergency department length of stay (LOS) as short or long. I am able to extract rules for the long stay but when i try to flip the signs (change positive class to short stay) i run into a "zero division error". Kindly if you could help me with the issue. Below is the error details.

Kind Regards,

Waqar Aziz

image
image
image
image

Support for XGBClassifier

Hello guys, first of all great work about current package!

I'm using it within my company and I'm finding it very useful.

The main limitation that I see right now is the fact that is not applicable to the most common XGBoost implementation (i.e. xgboost one; XGBClassifier), but only to the one of scikit-learn.

Do you see value on extending it? If so I can submit a PR to you

I've read your paper and quickly looked at your code, and I guess I should work on creating the proper adapter (https://github.com/linkedin/TE2Rules/blob/main/te2rules/adapter.py)

Let me know your feedback!

Keyerror

hello,

I'm running the te2rules demo code in colab and I'm getting the

model_explainer.explain(X=X_train, y=y_train_pred) where I keep getting the result of keyerror : 0. Do you know any solution?

keyerror:0

when i runing the demo code
"from te2rules.explainer import ModelExplainer
model_explainer = ModelExplainer(model=model, feature_names=feature_names)
rules = model_explainer.explain(X=x_train, y=y_train_pred)"
cause a error"keyerror:0"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.