modeloriented / modelstudio Goto Github PK

📍 Interactive Studio for Explanatory Model Analysis

Home Page: https://doi.org/10.1007/s10618-023-00924-w

License: GNU General Public License v3.0

R 82.74% CSS 1.63% JavaScript 15.63%

xai iml interpretability interpretable-machine-learning explanatory-model-analysis explainable-ai explainable-machine-learning model-visualization learning machine

modelstudio's Introduction

Interactive Studio for Explanatory Model Analysis

Overview

The modelStudio package automates the explanatory analysis of machine learning predictive models. It generates advanced interactive model explanations in the form of a serverless HTML site with only one line of code. This tool is model-agnostic, therefore compatible with most of the black-box predictive models and frameworks (e.g. mlr/mlr3, xgboost, caret, h2o, parsnip, tidymodels, scikit-learn, lightgbm, keras/tensorflow).

The main modelStudio() function computes various (instance and model-level) explanations and produces a customisable dashboard, which consists of multiple panels for plots with their short descriptions. It is possible to easily save the dashboard and share it with others. Tools for Explanatory Model Analysis unite with tools for Exploratory Data Analysis to give a broad overview of the model behavior.

explain COVID-19 R & Python examples More resources Interactive EMA

The modelStudio package is a part of the DrWhy.AI universe.

Installation

# Install from CRAN:
install.packages("modelStudio")

# Install the development version from GitHub:
devtools::install_github("ModelOriented/modelStudio")

Simple demo

library("DALEX")
library("ranger")
library("modelStudio")

# fit a model
model <- ranger(score ~., data = happiness_train)

# create an explainer for the model    
explainer <- explain(model,
                     data = happiness_test,
                     y = happiness_test$score,
                     label = "Random Forest")

# make a studio for the model
modelStudio(explainer)

Save the output in the form of a HTML file - Demo Dashboard.

R & Python examples more

The modelStudio() function uses DALEX explainers created with DALEX::explain() or DALEXtra::explain_*().

# packages for the explainer objects
install.packages("DALEX")
install.packages("DALEXtra")

mlr dashboard

Make a studio for the regression ranger model on the apartments data.

code

# load packages and data
library(mlr)
library(DALEXtra)
library(modelStudio)

data <- DALEX::apartments

# split the data
index <- sample(1:nrow(data), 0.7*nrow(data))
train <- data[index,]
test <- data[-index,]

# fit a model
task <- makeRegrTask(id = "apartments", data = train, target = "m2.price")
learner <- makeLearner("regr.ranger", predict.type = "response")
model <- train(learner, task)

# create an explainer for the model
explainer <- explain_mlr(model,
                         data = test,
                         y = test$m2.price,
                         label = "mlr")

# pick observations
new_observation <- test[1:2,]
rownames(new_observation) <- c("id1", "id2")

# make a studio for the model
modelStudio(explainer, new_observation)

xgboost dashboard

Make a studio for the classification xgboost model on the titanic data.

code

# load packages and data
library(xgboost)
library(DALEX)
library(modelStudio)

data <- DALEX::titanic_imputed

# split the data
index <- sample(1:nrow(data), 0.7*nrow(data))
train <- data[index,]
test <- data[-index,]

train_matrix <- model.matrix(survived ~.-1, train)
test_matrix <- model.matrix(survived ~.-1, test)

# fit a model
xgb_matrix <- xgb.DMatrix(train_matrix, label = train$survived)
params <- list(max_depth = 3, objective = "binary:logistic", eval_metric = "auc")
model <- xgb.train(params, xgb_matrix, nrounds = 500)

# create an explainer for the model
explainer <- explain(model,
                     data = test_matrix,
                     y = test$survived,
                     type = "classification",
                     label = "xgboost")

# pick observations
new_observation <- test_matrix[1:2, , drop=FALSE]
rownames(new_observation) <- c("id1", "id2")

# make a studio for the model
modelStudio(explainer, new_observation)

The modelStudio() function uses dalex explainers created with dalex.Explainer().

:: package for the Explainer object
pip install dalex -U

Use pickle Python module and reticulate R package to easily make a studio for a model.

# package for pickle load
install.packages("reticulate")

scikit-learn dashboard

Make a studio for the regression Pipeline SVR model on the fifa data.

code

First, use dalex in Python:

# load packages and data
import dalex as dx
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from numpy import log

data = dx.datasets.load_fifa()
X = data.drop(columns=['overall', 'potential', 'value_eur', 'wage_eur', 'nationality'], axis=1)
y = log(data.value_eur)

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# fit a pipeline model
model = Pipeline([('scale', StandardScaler()), ('svm', SVR())])
model.fit(X_train, y_train)

# create an explainer for the model
explainer = dx.Explainer(model, data=X_test, y=y_test, label='scikit-learn')

# pack the explainer into a pickle file
explainer.dump(open('explainer_scikitlearn.pickle', 'wb'))

Then, use modelStudio in R:

# load the explainer from the pickle file
library(reticulate)
explainer <- py_load_object("explainer_scikitlearn.pickle", pickle = "pickle")

# make a studio for the model
library(modelStudio)
modelStudio(explainer, B = 5)

lightgbm dashboard

Make a studio for the classification Pipeline LGBMClassifier model on the titanic data.

code

First, use dalex in Python:

# load packages and data
import dalex as dx
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMClassifier

data = dx.datasets.load_titanic()
X = data.drop(columns='survived')
y = data.survived

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# fit a pipeline model
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
  steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
  ]
)
categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
  steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
  ]
)

preprocessor = ColumnTransformer(
  transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
  ]
)

classifier = LGBMClassifier(n_estimators=300)

model = Pipeline(
  steps=[
    ('preprocessor', preprocessor),
    ('classifier', classifier)
  ]
)
model.fit(X_train, y_train)

# create an explainer for the model
explainer = dx.Explainer(model, data=X_test, y=y_test, label='lightgbm')

# pack the explainer into a pickle file
explainer.dump(open('explainer_lightgbm.pickle', 'wb'))

Then, use modelStudio in R:

# load the explainer from the pickle file
library(reticulate)
explainer <- py_load_object("explainer_lightgbm.pickle", pickle = "pickle")

# make a studio for the model
library(modelStudio)
modelStudio(explainer)

Save & share

Save modelStudio as a HTML file using buttons on the top of the RStudio Viewer or with r2d3::save_d3_html().

Citations

If you use modelStudio, please cite our JOSS article:

@article{baniecki2019modelstudio,
  title   = {{modelStudio: Interactive Studio with Explanations for ML Predictive Models}},
  author  = {Hubert Baniecki and Przemyslaw Biecek},
  journal = {Journal of Open Source Software},
  year    = {2019},
  volume  = {4},
  number  = {43},
  pages   = {1798},
  url     = {https://doi.org/10.21105/joss.01798}
}

For a description and evaluation of the Interactive EMA process, refer to our DAMI article:

@article{baniecki2023grammar,
  title   = {The grammar of interactive explanatory model analysis},
  author  = {Hubert Baniecki and Dariusz Parzych and Przemyslaw Biecek},
  journal = {Data Mining and Knowledge Discovery},
  year    = {2023},
  pages   = {1--37},
  url     = {https://doi.org/10.1007/s10618-023-00924-w}
}

More resources

Introduction to the plots: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models
Vignettes: perks and features, R & Python examples, modelStudio in R Markdown HTML
Changelog: NEWS
Conference poster: ML in PL 2019

Acknowledgments

Work on this package was financially supported by the National Science Centre (Poland) grant 2016/21/B/ST6/02176 and National Centre for Research and Development grant POIR.01.01.01-00-0328/17.

modelstudio's People

Contributors

Stargazers

Watchers

modelstudio's Issues

Add try blocks to skip errors in explanation computation

title

error when passing data with one value column

Progress bar verbose

It would be great to have a message about the currently calculated explanation.
Thank you in advance @hbaniecki

Display feature of interest in plots

Hi,

I would like to know how to display the features of interest on a modelStudio plot. It looks like modelStudio chooses the first feature in the data frame by default and that information on the rest of features is only made available by hovering over the plots.

Example from modelStudio website:

library("DALEX")
library("modelStudio")

# fit a model
model <- glm(survived ~., data = titanic_imputed, family = "binomial")

# create an explainer for the model    
explainer <- explain(model,
                     data = titanic_imputed,
                     y = titanic_imputed$survived,
                     label = "Titanic GLM")

# make a studio for the model
modelStudio(explainer)

The only feature displayed on the plot is gender, which is the first column in titanic_imputed.

Unless I'm missing something, it appears that there is no mention in the manual about how to change this. There's also no option for changing this in the actual plot.

Thanks.

Add description to ingredients plots

problem in describe

breakpoint_description <- ifelse(multiple_breakpoints, paste0("Breakpoints are identified at (",

variables, " = ", cut_name, " and ", variables, " = ",

round(df[cutpoint_additional, variables], 3), ")."),

paste0("Breakpoint is identified at (", variables, " = ",

```
  cut_name, ")."))
```

Browse[2]> prefix <- paste0("The highest prediction occurs for (",

variables, " = ", max_name, "),", " while the lowest for (",

variables, " = ", min_name, ").\n", breakpoint_description)

Browse[2]> cutpoint <- ifelse(multiple_breakpoints, cutpoint_additional,

```
cutpoint)
```

Browse[2]> sufix <- describe_numeric_variable(original_x = attr(x,

"observations"), df = df, cutpoint = cutpoint, variables = variables)

Browse[2]> description <- paste(introduction, prefix, sufix, sep = "\n\n")
Browse[2]> description

Add function that would return modelStudio data in JSON

Use stringsAsFactors=TRUE in data.frame for Rv4.0

Change plot titles to these in new DALEX v1.0.0

update modelStudio looks after the computation

updateModelStudio() function is supposed to allow user change parameters of how modelStudio is diplayed without re-computation.

Add dime to DrWhy README

Error in eval(predvars, data, env) : object 'parch' not found

I can't get your demonstration example to run. I also tried installing the newest version of modelStudio and ingredients using devtools, but I still get this error:

This is the output of sessionInfo():

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modelStudio_0.1.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.1     remotes_2.1.0      prettyunits_1.0.2 
 [6] ingredients_0.3.10 iterators_1.0.12   tools_3.6.1        testthat_2.2.1     digest_0.6.22     
[11] pkgbuild_1.0.6     pkgload_1.0.2      memoise_1.1.0      tibble_2.1.3       gtable_0.3.0      
[16] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.1        Matrix_1.2-17      foreach_1.4.7     
[21] cli_1.1.0          rstudioapi_0.10    curl_4.2           withr_2.1.2        fs_1.3.1          
[26] desc_1.2.0         devtools_2.2.1     rprojroot_1.3-2    glmnet_2.0-18      grid_3.6.1        
[31] glue_1.3.1         R6_2.4.0           processx_3.4.1     DALEX_0.4.7        sessioninfo_1.1.1 
[36] ggplot2_3.2.1      callr_3.3.2        magrittr_1.5       usethis_1.5.1      backports_1.1.5   
[41] scales_1.0.0       codetools_0.2-16   ps_1.3.0           ellipsis_0.3.0     assertthat_0.2.1  
[46] colorspace_1.4-1   lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

Am I doing something wrong?

v1.0.2 release checklist

Missing parts in documentation

Hi, I am one of the reviewers for your JOSS submission. I thought I'd put the things I miss in the documentation and the corresponding review checklist items here:

A statement of need: It is described what the software should solve, but I somehow miss what the target audience is. Is it researcher, machine learning practitioners, anyone interested in interpretable machine learning...?
Installation instructions: (this might be because I haven't used R much in the past year - as stated before I started the review). When I installed your package (on a Manjaro machine), I had issues because it also installed glmnet for which it requires gcc-fortran (which I had to install using my package manager). First I am wondering why it knew that it had to install glmnet - it is not mentioned in this libraries DESCRIPTION (I assume it is a dependency of one of the other packages?) And I am also not sure if it is required that your README mentions that one might need to install gcc-fortran (because it is not directly used by your package). Just wanted to let you know that this might be an issue :)
Automated tests: The reviewing check list asks "Are there automated tests or manual steps described so that the functionality of the software can be verified?" I can't find such a thing, maybe you can point me to it.
Community guidelines: The reviewing check list asks "Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support" I can't find such a thing, maybe you can point me to it.

update observations and calculate only local explanations

It would calculate local explanations and add them to existing modelStudio (without recalculating global explanations)

information about true y for selected observation

In the header of the modelStudio it will be good to show true y for selected observation

dependence

ModelOriented/ingredients#103

Update documentation and DESCRIPTION

For example in the modelStudio the description is outdated
The main goal of this function is to connect two local model explainers: Ceteris Paribus and Break Down. It also shows global explainers for your model such as Partial Dependency and Feature Importance.
In the DESCRIPTION the 'Description' needs to be updated

Add log scale to Feature Distribution plot

❓❔ FAQ & Troubleshooting ❔❓

modelStudio FAQ & Troubleshooting

Most of the information is covered in the documentation: https://modelstudio.drwhy.ai/

✨ Please, submit a new issue when dealing with potential bugs. Thanks! ✨

Error occurred during the modelStudio() computation
foo plot doesn't show up on the dashboard

Read the console output of DALEX::explain(). There could be a warning message pointing to the solution of this problem.
Read the console output of modelStudio(). There could be an error message (printed as a warning) pointing to the origin and solution of this problem.
Make sure to update these R packages to their latest versions: DALEX, ingredients, iBreakDown.

modelStudio() output shows up as a white window in the RStudio Viewer

Solve this by updating the RStudio. Please, check if the output shows up properly in the browser (e.g. use viewer = "browser" argument in modelStudio()).

y-axis labels go outside of the plot

Use modelStudio(..., options = ms_options(margin_left = 200)).

Unable to load the pickle file with Explainer object
See reticulate vignettes: Python Version Configuration Installing Python Packages
NA in data
See #71
Shiny support
See #77
Change the number of panels with plots from 2x2 to 1x2 or 3x3 (grid size of the dashboard)
See #54 (comment)

Change AT plot binning to histogram-like

new plot: scatterplot [EDA]

It will be great to have a new plot in the dashboard:
scatterplot for EDA
in the FIFA example I would like to see what is the relation between Player Value and Age
This will nicely supplement the PDP for the model

Change LICENSE to GPL-3

add new_observation default functionality

Add spellcheck to tests

Update gifs

Change ms title option

TODO

DALEXverse 0.19.8 release summer 2019

Integration

assigned: @pbiecek

Code review

consistency: names of functions
consistency: names of files
consistency: names of variables in functions (local and global)
length: functions
readability: code (comments, constructions)

assigned: @maksymiuks

Feature review

readability: documentation (title, description, details)
readability: examples (relevant, complete, with comments)
reproducibility: tests (code coverage)
links to functions: \code

assigned: @WojciechKretowicz

Check if ingredients fix will fix modelStudio

ModelOriented/ingredients#107

Add NEWS file

to track changes in consecutive versions of the package
(see an example in DALEX or archivist)

integrate with CI Travis and codecov

As in auditor or DALEX packages

Change only_numerical like in ingredients

Add instruction to README

PNG with arrows pointing at interactive elements would be handy

modelStudio(), explainer_mlr3() and NAs

Hi,

There's a glitch with modelStudio when using mlr3 pipelines with data with missing values.

It looks like modelStudio() doesn't know how to impute missing data before crunching the numbers, even when the user has incorporated a pipe operator for missing values in the mlr3 pipeline. In fact, modelStudio() does not even recognize mlr3 learners if their class is other than [1] "LearnerClassifRanger" "LearnerClassif" "Learner" "R6" (e.g. try class(learner) for a Random Forest learner). If you have a pipeline, whose class is [1] "GraphLearner" "Learner" "R6", modelStudio() doesn't know how to handle it.

Package DALExtra's explainer_mlr3() suffers from the same issue, although this can be dealt with by providing custom functions for arguments predict_function and residual_function.

Below is an example of a pipeline that imputes missing data and then balances classes. Note that it works fine when there are no missing data, but returns an error otherwise.

Example 1: no missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3) # NOTE: install mlr3 packages from GitHub, not CRAN, as they differ in a few things, e.g. with GitHub you tune the pipeline with $optimize() but with CRAN with $tune()
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

Working just fine.

Example 2: missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Create some missing data
data <- task$data()
data$V1[1:5] <- NA
task <- TaskClassif$new(data, id = 'sonar', target = 'Class')

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

We get errors and no plot:

Calculating ... 
  Calculating ingredients::feature_importance 
  Calculating ingredients::partial_dependence (numerical) 
  Calculating ingredients::accumulated_dependence (numerical) 
    Elapsed time: 00:01:01 ETA...Error in seq.default(min(x[, name]), max(x[, name]), length.out = nbins) : 
  'from' must be a finite number
In addition: Warning messages:
1: In value[[3L]](cond) : 
Error occurred in ingredients::partial_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
2: In value[[3L]](cond) : 
Error occurred in ingredients::accumulated_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE

Is there a way to pass imputed data from explainer_mlr3() to modelStudio() just like you can pass predictions and residuals with arguments predict_function and residual_function respectively? Any chances of implementing this please?

Thanks

check before CRAN

[AT] last bin not needed?
join [TV] and [FD] data

library("dime")
library("DALEX")

titanic <- na.omit(titanic)
set.seed(1313)
titanic_small <- titanic[sample(1:nrow(titanic), 500), c(1,2,3,6,7,9)]

model_titanic_glm <- glm(survived == "yes" ~ gender + age + fare + class + sibsp,
                         data = titanic_small, family = "binomial")

explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_small[,-6],
                               y = titanic_small$survived == "yes",
                               label = "glm")

new_observation <- titanic_small[1:10,-6]

modelStudio(explain_titanic_glm, new_observation[1,])

but this ends with

> modelStudio(explain_titanic_glm, new_observation[1,])
  |                                                                        |   0%Error in ceteris_paribus.default(x, data, predict_function = predict_function,  : 
  promise already under evaluation: recursive default argument reference or earlier problems?

Enter a frame number, or 0 to exit   

1: modelStudio(explain_titanic_glm, new_observation[1, ])
2: modelStudio.explainer(explain_titanic_glm, new_observation[1, ])
3: modelStudio.default(x = x$model, new_observation = new_observation, facet_dim
4: ingredients::accumulated_dependency(x, data, predict_function, only_numerical
5: accumulated_dependency.R#51: accumulated_dependency.default(x, data, predict_
6: accumulated_dependency.R#91: ceteris_paribus.default(x, data, predict_functio

modeloriented / modelstudio Goto Github PK

modelstudio's Introduction

Interactive Studio for Explanatory Model Analysis

Overview

Installation

Simple demo

R & Python examples more

mlr dashboard

xgboost dashboard

scikit-learn dashboard

lightgbm dashboard

Save & share

Citations

More resources

Acknowledgments

modelstudio's People

Contributors

Stargazers

Watchers

Forkers

modelstudio's Issues

modelStudio FAQ & Troubleshooting

Integration

Code review

Feature review

Recommend Projects

Recommend Topics

Recommend Org