Git Product home page Git Product logo

modeloriented / modelstudio Goto Github PK

View Code? Open in Web Editor NEW
319.0 20.0 32.0 37.1 MB

πŸ“ Interactive Studio for Explanatory Model Analysis

Home Page: https://doi.org/10.1007/s10618-023-00924-w

License: GNU General Public License v3.0

R 82.74% CSS 1.63% JavaScript 15.63%
xai iml interpretability interpretable-machine-learning explanatory-model-analysis explainable-ai explainable-machine-learning model-visualization learning machine

modelstudio's Introduction

Interactive Studio for Explanatory Model Analysis

CRAN_Status_Badge R build status Codecov test coverage JOSS-status

Overview

The modelStudio package automates the explanatory analysis of machine learning predictive models. It generates advanced interactive model explanations in the form of a serverless HTML site with only one line of code. This tool is model-agnostic, therefore compatible with most of the black-box predictive models and frameworks (e.g.Β mlr/mlr3, xgboost, caret, h2o, parsnip, tidymodels, scikit-learn, lightgbm, keras/tensorflow).

The main modelStudio() function computes various (instance and model-level) explanations and produces aΒ customisable dashboard, which consists of multiple panels for plots with their short descriptions. It is possible to easily save the dashboard andΒ share it with others. Tools for Explanatory Model Analysis unite with tools for Exploratory Data Analysis to give a broad overview of the model behavior.

explain COVID-19   R & Python examples   More resources   Interactive EMA  

The modelStudio package is a part of the DrWhy.AI universe.

Installation

# Install from CRAN:
install.packages("modelStudio")

# Install the development version from GitHub:
devtools::install_github("ModelOriented/modelStudio")

Simple demo

library("DALEX")
library("ranger")
library("modelStudio")

# fit a model
model <- ranger(score ~., data = happiness_train)

# create an explainer for the model    
explainer <- explain(model,
                     data = happiness_test,
                     y = happiness_test$score,
                     label = "Random Forest")

# make a studio for the model
modelStudio(explainer)

Save the output in the form of a HTML file - Demo Dashboard.

R & Python examples more


The modelStudio() function uses DALEX explainers created with DALEX::explain() or DALEXtra::explain_*().

# packages for the explainer objects
install.packages("DALEX")
install.packages("DALEXtra")

Make a studio for the regression ranger model on the apartments data.

code
# load packages and data
library(mlr)
library(DALEXtra)
library(modelStudio)

data <- DALEX::apartments

# split the data
index <- sample(1:nrow(data), 0.7*nrow(data))
train <- data[index,]
test <- data[-index,]

# fit a model
task <- makeRegrTask(id = "apartments", data = train, target = "m2.price")
learner <- makeLearner("regr.ranger", predict.type = "response")
model <- train(learner, task)

# create an explainer for the model
explainer <- explain_mlr(model,
                         data = test,
                         y = test$m2.price,
                         label = "mlr")

# pick observations
new_observation <- test[1:2,]
rownames(new_observation) <- c("id1", "id2")

# make a studio for the model
modelStudio(explainer, new_observation)

xgboost dashboard

Make a studio for the classification xgboost model on the titanic data.

code
# load packages and data
library(xgboost)
library(DALEX)
library(modelStudio)

data <- DALEX::titanic_imputed

# split the data
index <- sample(1:nrow(data), 0.7*nrow(data))
train <- data[index,]
test <- data[-index,]

train_matrix <- model.matrix(survived ~.-1, train)
test_matrix <- model.matrix(survived ~.-1, test)

# fit a model
xgb_matrix <- xgb.DMatrix(train_matrix, label = train$survived)
params <- list(max_depth = 3, objective = "binary:logistic", eval_metric = "auc")
model <- xgb.train(params, xgb_matrix, nrounds = 500)

# create an explainer for the model
explainer <- explain(model,
                     data = test_matrix,
                     y = test$survived,
                     type = "classification",
                     label = "xgboost")

# pick observations
new_observation <- test_matrix[1:2, , drop=FALSE]
rownames(new_observation) <- c("id1", "id2")

# make a studio for the model
modelStudio(explainer, new_observation)

The modelStudio() function uses dalex explainers created with dalex.Explainer().

:: package for the Explainer object
pip install dalex -U

Use pickle Python module and reticulate R package to easily make a studio for a model.

# package for pickle load
install.packages("reticulate")

scikit-learn dashboard

Make a studio for the regression Pipeline SVR model on the fifa data.

code

First, use dalex in Python:

# load packages and data
import dalex as dx
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from numpy import log

data = dx.datasets.load_fifa()
X = data.drop(columns=['overall', 'potential', 'value_eur', 'wage_eur', 'nationality'], axis=1)
y = log(data.value_eur)

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# fit a pipeline model
model = Pipeline([('scale', StandardScaler()), ('svm', SVR())])
model.fit(X_train, y_train)

# create an explainer for the model
explainer = dx.Explainer(model, data=X_test, y=y_test, label='scikit-learn')

# pack the explainer into a pickle file
explainer.dump(open('explainer_scikitlearn.pickle', 'wb'))

Then, use modelStudio in R:

# load the explainer from the pickle file
library(reticulate)
explainer <- py_load_object("explainer_scikitlearn.pickle", pickle = "pickle")

# make a studio for the model
library(modelStudio)
modelStudio(explainer, B = 5)

lightgbm dashboard

Make a studio for the classification Pipeline LGBMClassifier model on the titanic data.

code

First, use dalex in Python:

# load packages and data
import dalex as dx
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMClassifier

data = dx.datasets.load_titanic()
X = data.drop(columns='survived')
y = data.survived

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# fit a pipeline model
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
  steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
  ]
)
categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
  steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
  ]
)

preprocessor = ColumnTransformer(
  transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
  ]
)

classifier = LGBMClassifier(n_estimators=300)

model = Pipeline(
  steps=[
    ('preprocessor', preprocessor),
    ('classifier', classifier)
  ]
)
model.fit(X_train, y_train)

# create an explainer for the model
explainer = dx.Explainer(model, data=X_test, y=y_test, label='lightgbm')

# pack the explainer into a pickle file
explainer.dump(open('explainer_lightgbm.pickle', 'wb')) 

Then, use modelStudio in R:

# load the explainer from the pickle file
library(reticulate)
explainer <- py_load_object("explainer_lightgbm.pickle", pickle = "pickle")

# make a studio for the model
library(modelStudio)
modelStudio(explainer)

Save & share

Save modelStudio as a HTML file using buttons on the top of the RStudio Viewer or with r2d3::save_d3_html().

Citations

If you use modelStudio, please cite our JOSS article:

@article{baniecki2019modelstudio,
  title   = {{modelStudio: Interactive Studio with Explanations for ML Predictive Models}},
  author  = {Hubert Baniecki and Przemyslaw Biecek},
  journal = {Journal of Open Source Software},
  year    = {2019},
  volume  = {4},
  number  = {43},
  pages   = {1798},
  url     = {https://doi.org/10.21105/joss.01798}
}

For a description and evaluation of the Interactive EMA process, refer to our DAMI article:

@article{baniecki2023grammar,
  title   = {The grammar of interactive explanatory model analysis},
  author  = {Hubert Baniecki and Dariusz Parzych and Przemyslaw Biecek},
  journal = {Data Mining and Knowledge Discovery},
  year    = {2023},
  pages   = {1--37},
  url     = {https://doi.org/10.1007/s10618-023-00924-w}
}

More resources

Acknowledgments

Work on this package was financially supported by the National Science Centre (Poland) grant 2016/21/B/ST6/02176 and National Centre for Research and Development grant POIR.01.01.01-00-0328/17.

modelstudio's People

Contributors

hbaniecki avatar kyleniemeyer avatar pbiecek avatar piotrpiatyszek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modelstudio's Issues

Display feature of interest in plots

Hi,

I would like to know how to display the features of interest on a modelStudio plot. It looks like modelStudio chooses the first feature in the data frame by default and that information on the rest of features is only made available by hovering over the plots.

Example from modelStudio website:

library("DALEX")
library("modelStudio")

# fit a model
model <- glm(survived ~., data = titanic_imputed, family = "binomial")

# create an explainer for the model    
explainer <- explain(model,
                     data = titanic_imputed,
                     y = titanic_imputed$survived,
                     label = "Titanic GLM")

# make a studio for the model
modelStudio(explainer)

The only feature displayed on the plot is gender, which is the first column in titanic_imputed.

Unless I'm missing something, it appears that there is no mention in the manual about how to change this. There's also no option for changing this in the actual plot.

Thanks.

problem in describe

breakpoint_description <- ifelse(multiple_breakpoints, paste0("Breakpoints are identified at (",

  • variables, " = ", cut_name, " and ", variables, " = ", 
    
  • round(df[cutpoint_additional, variables], 3), ")."), 
    
  • paste0("Breakpoint is identified at (", variables, " = ", 
    
  •   cut_name, ")."))
    

Browse[2]> prefix <- paste0("The highest prediction occurs for (",

  • variables, " = ", max_name, "),", " while the lowest for (", 
    
  • variables, " = ", min_name, ").\n", breakpoint_description)
    

Browse[2]> cutpoint <- ifelse(multiple_breakpoints, cutpoint_additional,

  • cutpoint)
    

Browse[2]> sufix <- describe_numeric_variable(original_x = attr(x,

  • "observations"), df = df, cutpoint = cutpoint, variables = variables)
    

Browse[2]> description <- paste(introduction, prefix, sufix, sep = "\n\n")
Browse[2]> description

Error in eval(predvars, data, env) : object 'parch' not found

I can't get your demonstration example to run. I also tried installing the newest version of modelStudio and ingredients using devtools, but I still get this error:

image

This is the output of sessionInfo():

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modelStudio_0.1.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.1     remotes_2.1.0      prettyunits_1.0.2 
 [6] ingredients_0.3.10 iterators_1.0.12   tools_3.6.1        testthat_2.2.1     digest_0.6.22     
[11] pkgbuild_1.0.6     pkgload_1.0.2      memoise_1.1.0      tibble_2.1.3       gtable_0.3.0      
[16] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.1        Matrix_1.2-17      foreach_1.4.7     
[21] cli_1.1.0          rstudioapi_0.10    curl_4.2           withr_2.1.2        fs_1.3.1          
[26] desc_1.2.0         devtools_2.2.1     rprojroot_1.3-2    glmnet_2.0-18      grid_3.6.1        
[31] glue_1.3.1         R6_2.4.0           processx_3.4.1     DALEX_0.4.7        sessioninfo_1.1.1 
[36] ggplot2_3.2.1      callr_3.3.2        magrittr_1.5       usethis_1.5.1      backports_1.1.5   
[41] scales_1.0.0       codetools_0.2-16   ps_1.3.0           ellipsis_0.3.0     assertthat_0.2.1  
[46] colorspace_1.4-1   lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4   

Am I doing something wrong?

v1.0.2 release checklist

  • update fifa20
  • unify python pipelines with dalex notebook
  • update gifs
  • update dashboards
  • test examples
  • test vignettes
  • rhub::check_for_cran()
  • rhub::check_with_rdevel()
  • usethis::use_cran_comments()
  • devtools::submit_cran()
  • accept the mail
  • tag release on GitHub

Missing parts in documentation

Hi, I am one of the reviewers for your JOSS submission. I thought I'd put the things I miss in the documentation and the corresponding review checklist items here:

  • A statement of need: It is described what the software should solve, but I somehow miss what the target audience is. Is it researcher, machine learning practitioners, anyone interested in interpretable machine learning...?
  • Installation instructions: (this might be because I haven't used R much in the past year - as stated before I started the review). When I installed your package (on a Manjaro machine), I had issues because it also installed glmnet for which it requires gcc-fortran (which I had to install using my package manager). First I am wondering why it knew that it had to install glmnet - it is not mentioned in this libraries DESCRIPTION (I assume it is a dependency of one of the other packages?) And I am also not sure if it is required that your README mentions that one might need to install gcc-fortran (because it is not directly used by your package). Just wanted to let you know that this might be an issue :)
  • Automated tests: The reviewing check list asks "Are there automated tests or manual steps described so that the functionality of the software can be verified?" I can't find such a thing, maybe you can point me to it.
  • Community guidelines: The reviewing check list asks "Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support" I can't find such a thing, maybe you can point me to it.

Update documentation and DESCRIPTION

For example in the modelStudio the description is outdated
The main goal of this function is to connect two local model explainers: Ceteris Paribus and Break Down. It also shows global explainers for your model such as Partial Dependency and Feature Importance.
In the DESCRIPTION the 'Description' needs to be updated

❓❔ FAQ & Troubleshooting ❔❓

modelStudio FAQ & Troubleshooting

Most of the information is covered in the documentation: https://modelstudio.drwhy.ai/


✨ Please, submit a new issue when dealing with potential bugs. Thanks! ✨


  • Error occurred during the modelStudio() computation
  • foo plot doesn't show up on the dashboard
  1. Read the console output of DALEX::explain(). There could be a warning message pointing to the solution of this problem.
  2. Read the console output of modelStudio(). There could be an error message (printed as a warning) pointing to the origin and solution of this problem.
  3. Make sure to update these R packages to their latest versions: DALEX, ingredients, iBreakDown.
  • modelStudio() output shows up as a white window in the RStudio Viewer

Solve this by updating the RStudio. Please, check if the output shows up properly in the browser (e.g.Β use viewer = "browser" argument in modelStudio()).

  • y-axis labels go outside of the plot

Use modelStudio(..., options = ms_options(margin_left = 200)).

new plot: scatterplot [EDA]

It will be great to have a new plot in the dashboard:
scatterplot for EDA
in the FIFA example I would like to see what is the relation between Player Value and Age
This will nicely supplement the PDP for the model

TODO

  • change mlr? example to the regression model on DALEX::apartments
  • change sklearn? example to the regression model on dalex fifa
  • use Explainer.dump() in python examples
  • use ranger instead of randomForest (everywhere)
  • pip install dalex console chunk
  • update gifs
  • add parsnip example
  • use macos devel in gh-actions
  • citation
  • change default B = 10, N = 300 to support "fast feedback loop" process
  • add N/n_samples to feature_importance calculation
  • remove d3 from DESC and README
  • remove txtProgressBar import
  • remove covr from suggests
  • fix wrong vignette indexEntry p&r
  • write blog about IEMA

DALEXverse 0.19.8 release summer 2019

DALEXverse 0.19.8 release summer 2019

Integration

  • readability: vignettes
  • readability: NEWS
  • readability: DESCRIPTION
  • consistency: pkgdown website
  • consistency: entry at DrWhy.AI webpage

assigned: @pbiecek

Code review

  • consistency: names of functions
  • consistency: names of files
  • consistency: names of variables in functions (local and global)
  • length: functions
  • readability: code (comments, constructions)

assigned: @maksymiuks

Feature review

  • readability: documentation (title, description, details)
  • readability: examples (relevant, complete, with comments)
  • reproducibility: tests (code coverage)
  • links to functions: \code

assigned: @WojciechKretowicz

Add NEWS file

to track changes in consecutive versions of the package
(see an example in DALEX or archivist)

modelStudio(), explainer_mlr3() and NAs

Hi,

There's a glitch with modelStudio when using mlr3 pipelines with data with missing values.

It looks like modelStudio() doesn't know how to impute missing data before crunching the numbers, even when the user has incorporated a pipe operator for missing values in the mlr3 pipeline. In fact, modelStudio() does not even recognize mlr3 learners if their class is other than [1] "LearnerClassifRanger" "LearnerClassif" "Learner" "R6" (e.g. try class(learner) for a Random Forest learner). If you have a pipeline, whose class is [1] "GraphLearner" "Learner" "R6", modelStudio() doesn't know how to handle it.

Package DALExtra's explainer_mlr3() suffers from the same issue, although this can be dealt with by providing custom functions for arguments predict_function and residual_function.

Below is an example of a pipeline that imputes missing data and then balances classes. Note that it works fine when there are no missing data, but returns an error otherwise.

Example 1: no missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3) # NOTE: install mlr3 packages from GitHub, not CRAN, as they differ in a few things, e.g. with GitHub you tune the pipeline with $optimize() but with CRAN with $tune()
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

Working just fine.

Example 2: missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Create some missing data
data <- task$data()
data$V1[1:5] <- NA
task <- TaskClassif$new(data, id = 'sonar', target = 'Class')

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

We get errors and no plot:

Calculating ... 
  Calculating ingredients::feature_importance 
  Calculating ingredients::partial_dependence (numerical) 
  Calculating ingredients::accumulated_dependence (numerical) 
    Elapsed time: 00:01:01 ETA...Error in seq.default(min(x[, name]), max(x[, name]), length.out = nbins) : 
  'from' must be a finite number
In addition: Warning messages:
1: In value[[3L]](cond) : 
Error occurred in ingredients::partial_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
2: In value[[3L]](cond) : 
Error occurred in ingredients::accumulated_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE

Is there a way to pass imputed data from explainer_mlr3() to modelStudio() just like you can pass predictions and residuals with arguments predict_function and residual_function respectively? Any chances of implementing this please?

Thanks

v1.1.0 release checklist

  • add ms_update_options() and ms_update_observations() to the perks vignette
  • test vignettes
  • update dashboards
  • rhub::check_for_cran()
  • rhub::check_with_rdevel()
  • usethis::use_cran_comments()
  • devtools::submit_cran()
  • accept the mail
  • tag release on GitHub

error in the example

I was trying to execute an example for modelStudio

library("dime")
library("DALEX")

titanic <- na.omit(titanic)
set.seed(1313)
titanic_small <- titanic[sample(1:nrow(titanic), 500), c(1,2,3,6,7,9)]

model_titanic_glm <- glm(survived == "yes" ~ gender + age + fare + class + sibsp,
                         data = titanic_small, family = "binomial")

explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_small[,-6],
                               y = titanic_small$survived == "yes",
                               label = "glm")

new_observation <- titanic_small[1:10,-6]

modelStudio(explain_titanic_glm, new_observation[1,])

but this ends with

> modelStudio(explain_titanic_glm, new_observation[1,])
  |                                                                        |   0%Error in ceteris_paribus.default(x, data, predict_function = predict_function,  : 
  promise already under evaluation: recursive default argument reference or earlier problems?

Enter a frame number, or 0 to exit   

1: modelStudio(explain_titanic_glm, new_observation[1, ])
2: modelStudio.explainer(explain_titanic_glm, new_observation[1, ])
3: modelStudio.default(x = x$model, new_observation = new_observation, facet_dim
4: ingredients::accumulated_dependency(x, data, predict_function, only_numerical
5: accumulated_dependency.R#51: accumulated_dependency.default(x, data, predict_
6: accumulated_dependency.R#91: ceteris_paribus.default(x, data, predict_functio

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.