modeloriented / dalex Goto Github PK
View Code? Open in Web Editor NEWmoDel Agnostic Language for Exploration and eXplanation
Home Page: https://dalex.drwhy.ai
License: GNU General Public License v3.0
moDel Agnostic Language for Exploration and eXplanation
Home Page: https://dalex.drwhy.ai
License: GNU General Public License v3.0
This issue is related to a previous one: #4
I get an n.trees error even when passing the n.trees argument to variable_response() with an explainer created on a gbm model object.
library(gbm)
library(DALEX)
library(breakDown)
# create a gbm model
model <- gbm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 4,
shrinkage = 0.01,
n.minobsinnode = 10,
verbose = FALSE)
# make an explainer for the model
explainer_gbm <- explain(model, data = wine)
# single variable
exp_sgn <- variable_response(explainer_gbm, variable = "alcohol", n.trees = 1000)
Error in paste("Using", n.trees, "trees...\n") :
argument "n.trees" is missing, with no default](url)
Hi,
just wondering how easy it would be to allow prediction_breakdown
to work with a linear model when you use a spline term in the predictor. Here is my example:
apart.lm <- lm(m2.price ~ ns(construction.year, df=5) + surface + floor + no.rooms + district,
data=apartments)
aplm.ex <- explain(apart.lm, data=apartmentsTest[, 2:6], y = apartmentsTest$m2.price)
new_apartment <- apartmentsTest[1,]
aplm.bd <- prediction_breakdown(aplm.ex, observation=new_apartment)
Gives the error:
Error in `[.data.frame`(new_observation, colnames(ny)) :
undefined columns selected
I guess I could take the ns
part out and use the generated basis function, but it would be convenient if you didn't have to do this, particularly if you wanted to compare, eg, models with different basis functions.
Robert
When plotting from model_performance function, would it be possible to add functionality to limit x-axis values, as well as facets by some model factors to try to drill down into specific factors that drive the overall residuals?
Apologies in advance if these functionalities already exists. #Beginnerhere
Hi,
Firstly, thank you very much for the package and the extensive tutorials.
In the library iml
you are able to interrogate your model for interactions based on the amount of variance explained. I would be interested in being able to review an ALE plot in relation to these interactions with respect to the model output. I see in the vignette on page 10 for the package ALEPlot
that they seem to have implemented this but I have been unsuccessful in getting it to function consistently with a 2 class classification problem using caret.
Are there plans to implement anything similar in DALEX
Thank you very much for your time
I use the following code :
wineLmModel <- lm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine) wineLmExplainer <- explain(wineLmModel)
Error in UseMethod("explain") :
no applicable method for 'explain' applied to an object of class "lm"
Hi,
Is there a way to enforce constraints in the features contribution calculation resulting from prediction_breakdown , for example to enforce some features to have positive contribution?
Please note I'm already using monotonicity constraints in the xgboost training.
Thanks
It would be useful to have predict.explainer()
function that calls predict_function()
from explainer
.
The description of the argument data in explain() function is:
data - data.frame or marix - data that was used for fitting. If not provided then will be extracted from model fit
But if is not provided explainer$data is NULL.
And then function single_variable gives an error.
Example code:
library(DALEX)
input <- mtcars[,c("am","cyl","hp","wt")]
model.glm = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
explainer_glm <- explain(model.glm)
expl_glm <- single_variable(explainer_glm, variable = "wt", type="pdp")
The code above returns the error:
Error in partial.default(explainer$model, pred.var = variable, train = explainer$data, :
wt not found in the training data.
But it works with provided argument data.
explainer_glm <- explain(model.glm, data = input)
expl_glm <- single_variable(explainer_glm, variable = "wt", type="pdp")
when the variable type is categorical to plot use single_variable is become slowly !
Thank you for your excellent work.
I am currently thinking about using it / recommending it to clients / contributing to it, but the fact that no explicit licence is chosen in this repo yet makes that organisationally more difficult for me (and other companies) than it could be.
Clients also have issues (reasonable and unreasonable ones) with the GPL licence. So maybe it could be something else?
https://help.github.com/articles/adding-a-license-to-a-repository/
Best regards,
Frank
No matter which predict_function
was passed to explain()
, the results of the PDP plots in variable_response
are the same.
library(breakDown)
library(randomForest)
data(HR_data)
HR_rf_model <- randomForest(factor(left)~., data = breakDown::HR_data, ntree = 100)
explainer_rf <- explain(HR_rf_model, data = HR_data,
predict_function = function(model, x) predict(model, x, type = "prob")[,2])
expl_rf <- variable_response(explainer_rf, variable = "satisfaction_level", type = "pdp")
plot(expl_rf)
explainer_rf_constant <- explain(HR_rf_model, data = HR_data,
predict_function = function(model, x) return(0.5))
expl_rf_constant <- variable_response(explainer_rf_constant, variable = "satisfaction_level", type = "pdp")
plot(expl_rf_constant)
Plots are the same, but should be different for different predict functions.
I think the solution is passing predict_function
to pdp:: partial()
by pred.fun
parameter.
Same problem for ALE plots.
Hi,
Following the example on https://pbiecek.github.io/DALEX/reference/plot.model_performance_explainer.html , if you rearrange the order of arguments from plot(mp_rf, mp_glm, mp_lm, geom = "boxplot", show_outliers = 1) to plot(mp_glm, mp_lm, mp_rf, geom = "boxplot", show_outliers = 1), you will get a graph where the outliers don't match the model.
It seems like we have to input the models best to worst in terms of root mean square of residuals for it for the outliers' label to match the model.
Are there any scientific (peer-reviewed) publications, conference proceedings or books about DALEX and related packages of the same authors that I could cite in a scientific work? Is there a place where I could find a comprehensive list of those publications?
It would be nice if this list was also included on the website of DALEX.
Hi,
I tried to install both the CRAN and the GitHub version of DALEX, but I keep getting the following error
Error : .onLoad failed in loadNamespace() for 'sf', details:
call: get(genname, envir = envir)
error: object 'group_map' not found
The error seem to appear when installing the factorMerger library.
Any help appreciated to debug.
I have a problem to understand a variable_response plot when the explanatory variable is a factor and has 2 levels.
library(DALEX)
library(carData)
library(randomForest)
data("Leinhardt")
df <- Leinhardt %>% select(infant, income, region, oil)
df <- na.omit(df)
rf2 <- randomForest(infant ~ income + region + oil, data = df)
rf2_exp <- DALEX::explain(rf2, data = df, y = df$infant, label = "rf")
rf2_rv <- variable_response(rf2_exp, variable = "oil", type = "factor")
plot(rf2_rv)
I get an n.trees error when trying to use the variable_importance() function on an explainer for a gbm model created with the gbm package:
library(gbm)
library(DALEX)
mod <- gbm(m2.price~.,data = apartments, distribution = "gaussian")
exp.mod <- explain(mod, data = apartmentsTest[,2:6],
y = apartmentsTest$m2.price)
vi <- variable_importance(exp.mod, loss_function = loss_root_mean_square)
Error in paste("Using", n.trees, "trees...\n") :
argument "n.trees" is missing, with no default
vi <- variable_importance(exp.mod, loss_function = loss_root_mean_square, n.trees = 2000)
Error in paste("Using", n.trees, "trees...\n") :
argument "n.trees" is missing, with no default
New to this so please let me know if I'm misinterpreting the functionality!
Currently, the baseline
argument to broken()
is hardcoded to be "Intercept" and there is no way to modify it. This parameter should be exposed, and the default may lead to confusion because one would expect "final_prognosis" to be equal to the prediction, at least for models with an identity link function. Also, the plot method doesn't tell you what the baseline is, so it's difficult to tell the story of how we got to the prediction (since we don't see the prediction on the plot.)
DALEX should support cases in which the predict
function returns more than a single column
Think about multi class classification of multivariate regression.
It should be handled in the same way as multiple models
ingredients
and iBreakDown
CONTRIBUTING.md
file (maybe based on survxai
)As in the ingredients
and iBreakDown
Maybe adding small barplots under PDP curve indicating how many observations for a given x value are in the dataset would be useful?
0.2.4 will go to CRAN since the HR
data is required for ceterisParibus plots.
Any other fixes should go with this version?
There are: single_variable(), single_prediction(), and variable_dropout()
instead of variable_response(), prediction_breakdown(), and variable_importance().
I got an error message when trying to extract the variable response for factor with a xgboost model.
library(DALEX)
library(breakDown)
library(xgboost)
data(HR_data)
model_matrix_train <- model.matrix(left ~ . - 1, HR_data)
data_train <- xgb.DMatrix(model_matrix_train, label = HR_data$left)
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")
HR_xgb_model <- xgb.train(param, data_train, nrounds = 50)
HR_xgb_model
predict_logit <- function(model, x) {
raw_x <- predict(model, x)
exp(raw_x)/(1 + exp(raw_x))
}
logit <- function(x) exp(x)/(1+exp(x))
explainer_xgb <- explain(HR_xgb_model,
data = model_matrix_train,
y = HR_data$left,
predict_function = predict_logit,
link = logit,
label = "xgboost")
explainer_xgb
x_rv <- variable_response(explainer_xgb, variable = "salary", type = "factor")
x_rv <- variable_response(explainer_xgb, variable = "salary", type = "factor")
Error in explainer$data[, variable] : subscript out of bounds
Did the prescribed:
devtools::install_github("pbiecek/DALEX")
but Rstudio (latest version under Linux),
will not complete the DALEX install.
After the above command,
R will just hang and nothing happens...
R will just hang there.
Waited for 5 minutes,
then suspended the installation.
What am I missing
in order to install DALEX ?.
(all other R packages have installed with no problems...).
Thanks!
Current names are chaotic.
Here are propositions for new names. Old names will stay as deprecated.
variable_dropout() -> variable_importance()
single_variable() -> variable_response()
single_prediction() -> prediction_breakdown()
New names are more consistent with planned: outlier_detection(), model_performance()
There is error while running single_prediction()
function with xgboost model:
Error in new_observation[rep(1, nrow(data)), ] : incorrect number of dimensions
Does this function support xgb.Booster model types?
I have DALEX_0.1.8 and breakDown_0.1.4.
with the use of factorMerger package!
Consider following changes in names
single_variable
-> variable_response
single_prediction
-> prediction_decomposition
variable_importance
-> variable_leverage
How do we interpret the baseline in variable importance? Noticed that when the variable importances for different models are plotted simultaneously that the baseline numbers don't agree, even with type = "ratio", n_sample = -1
.
I use a Mc Os with S.O. High Sierra: Version 10.13.3
Rstudio: Version 1.1.4.23
library(breakDown) Version 0.1.5
library(DALEX) Version 0.1.1
if I run your example:
library("breakDown")
new.wine <- data.frame(citric.acid = 0.35,sulphates = 0.6,alcohol = 12.5,pH = 3.36, residual.sugar = 4.8)
wine_lm_model4 <- lm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine)
wine_lm_explainer4 <- explain(wine_lm_model4, data = wine, label = "model_4v")
wine_lm_predict4 <- single_prediction(wine_lm_explainer4, observation = new.wine)
plot(wine_lm_predict4)
all works fine.
But if I use my data
....
a=read.xls('new_longitudinali.xls',sheet=2)
BG=as.numeric(a$Modello)
b=data.frame(a[2:12],BG)
attach(b)
new.b <- data.frame(ArGoMe <=129.6,
CoGo<=47.4,
GoGn <=72.3,
Nme <=104.5,
NSAr <= 121.5 ,
PPMP <= 27.1,
PPSN <= 8.7,
Sar <= 29.2,
SN <= 63.8,
SNA <= 79.6,
SNB <=79.9,
)
b_lm_model<- lm(BG ~ArGoMe+CoGo+GoGn+Nme+NSAr+PPMP+PPSN+Sar+SN+SNA+SNB , data = b)
b_lm_explainer <- explain(b_lm_model, data = b, label = "model_4v")
b_lm_predict <- single_prediction(b_lm_explainer, observation = new.b)
the “single_prediction” function give me back the following error:
Error in [.data.frame
(new_observation, colnames(ny)) :
undefined columns selected
How can I solve it ?
Thanks in advance for your help.
The package is great for teaching purposes. Sadly it seems (a priori!) that the function single_variable() doesn't work with neuralnet( ) model.
Here a reproducible example taken from the vignette adding a neuralnet( ) model:
set.seed(13)
N <- 250
X1 <- runif(N)
X2 <- runif(N)
X3 <- runif(N)
X4 <- runif(N)
X5 <- runif(N)
f <- function(x1, x2, x3, x4, x5) {
((x1-0.5)2)^2-0.5 + sin(x210) + x3^6 + (x4-0.5)2 + abs(2x5-1)
}
y <- f(X1, X2, X3, X4, X5)
library(randomForest)
library(DALEX)
library(e1071)
library(rms)
library(neuralnet)
df <- data.frame(y, X1, X2, X3, X4, X5)
model_rf<-randomForest(y~., df)
model_svm<-svm(y~., df)
model_lm<-lm(y~., df)
model_nn<-neuralnet(y~X1+X2+X3+X4+X5,df,hidden=1)
dd <- datadist(df)
options(datadist="dd")
model_rms <- ols(y ~ rcs(X1) + rcs(X2) + rcs(X3) + rcs(X4) + rcs(X5), df)
ex_rf<-explain(model_rf)
ex_svm<-explain(model_svm)
ex_lm<-explain(model_lm)
ex_nn<-explain(model_nn)
ex_rms<-explain(model_rms, label = "rms", data = df[, -1], y = df$y)
ex_tr<-explain(model_lm, data = df[,-1],
predict_function = function(m, x) f(x[,1], x[,2], x[,3], x[,4], x[,5]),
label = "True Model")
library(ggplot2)
plot(single_variable(ex_rf, "X1"),
single_variable(ex_svm, "X1"),
single_variable(ex_lm, "X1"),
single_variable(ex_nn, "X1"),
single_variable(ex_rms, "X1"),
single_variable(ex_tr, "X1")) +
ggtitle("Responses for X1. Truth: y ~ (2*x1 - 1)^2")
For measures that are maximized like AUC
parsnip is a new model factory
https://tidymodels.github.io/parsnip/index.html
would be nice to have a vignette that shows how to use DALEX with parsnip models
move these packages to Suggested
The explain()
method exists also in the dplyr package as the generic function.
DALEX will behave differently depending on which package is loaded first (DALEX / dplyr)
not clear how to solve this
While passing a wrong parameter value, raw drop losses are calculated.
It would be helpful if variable_importance()
return an error or a warning with information that type = "raw"
was taken.
library(randomForest)
model_regr_rf <- randomForest(m2.price~., data = apartments, ntree = 50)
explainer_regr_rf <- explain(model_regr_rf, data = apartmentsTest[1:1000, ], y = apartmentsTest$m2.price[1:1000])
variable_importance(explainer_regr_rf, type="anything")
Would DALEX is going to support h2o models?
Currently, H2OFrame is not supported when running single_variable
.
I have fitted an xgboost model.
Use of the object and its fit field in explain() function of DALEX is not possible.
xgb= boost_tree(mode = "regression") %>%
set_engine(engine = "xgboost") %>%
fit(formula = mpg ~ ., mtcars)
expl= explain(xgb, data = mtcars, y = mtcars$mpg)
variable_importance(expl)
[Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix does not support construction from list](url)
expl= explain(xgb$fit, data = mtcars, y = mtcars$mpg)
variable_importance(expl)
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix does not support construction from list
Any tips to achieve interop?
(tidymodels/parsnip#127)
Hi.
If data
parameter is not passed to explain
function, it is extracted by default from model as model.frame(data)
(if possible). The assumption here is that data
should be training data used by model.
In some cases it's not true. Let's consider
model <- glm(log(qsec) ~ exp(drat) + hp, data = mtcars)
In here model.frame(model)
stores transformed variables, ie.
colnames(model.frame(model)) == c('log(qsec)', 'exp(drat)', 'hp')
I think the best way is to use by default:
eval(stats::getCall(model)$data)
As it uses envir = parent.frame()
by default it should source training data that was used in model call.
If you do have test frame as tibble (easy to get when using tidyverse)
For calculation of variable importance you get the same values for full_model and all variables except baseline, which is obviously wrong.
variable dropout_loss label
1 full_model 284.9159 lm
2 construction.year 284.9159 lm
3 surface 284.9159 lm
4 floor 284.9159 lm
5 no.rooms 284.9159 lm
6 district 284.9159 lm
7 baseline 1261.6643 lm
For single_variable calculations you get following warning (only), however output is of limited value.
Warning message:
In if (class(explainer$data[, variable]) == "factor" & type != "factor") { :
the condition has length > 1 and only the first element will be used
Casting tibble to regular data.frame solves the issue. Having training data as tibble seems not to have an impact on calculations at all.
`apartmentsTest_tibble <- apartmentsTest %>% as_tibble()
model_liniowy <- lm(m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
explainer_lm <- explain(model_liniowy, data = apartmentsTest_tibble[,2:6], y = apartmentsTest_tibble$m2.price)
vi_lm <- variable_importance(explainer_lm, loss_function = loss_root_mean_square)
vi_lm
sv_lm <- single_variable(explainer_lm, variable = "construction.year", type = "pdp")`
Advantages:
Disadvantages
In article https://pbiecek.github.io/DALEX/articles/DALEX_and_xgboost.html
there is a missspelling
model_martix_train
it shoul be model_matrix_train
Do we need to fix/add anything before this will be submitted?
Shapley values will get to DALEX in the next version
Support for mlr as well
Wouldn't it be good to let "observation" argument be a list/a df of observations to explain?
When I try to run the vignette examples for the single_prediction() function I see the following error for the random forest model:
Error in UseMethod("broken") :
no applicable method for 'broken' applied to an object of class "c('randomForest.formula', 'randomForest')"
I have DALEX package version 0.1 and breakDown 0.1.3
Hi,
Great package and contribution to the understanding of statistical / ML models. Would you be interested in including some examples that use the rms family of models? The rms family of models (e.g. ols
) and associated predict
methods simplify the process of integrating basis function expansion (e.g. restricted cubic basis functions) into linear models.
Here is an adaptation of one of your vignettes. Note that with the default number of rcs terms (4 knots) the linear model predictions are 99% identical to source data.
library(randomForest)
library(DALEX)
library(e1071)
library(rms)
library(ggplot2)
set.seed(13)
N <- 250
X1 <- runif(N)
X2 <- runif(N)
X3 <- runif(N)
X4 <- runif(N)
X5 <- runif(N)
f <- function(x1, x2, x3, x4, x5) {
res <- ((x1-0.5)*2)^2-0.5 + sin(x2*10) + x3^6 + (x4-0.5)*2 + abs(2*x5-1)
return(res)
}
y <- f(X1, X2, X3, X4, X5)
df <- data.frame(y, X1, X2, X3, X4, X5)
## important setup step required for use of rms functions
dd <- datadist(df)
options(datadist="dd")
model_rf <- randomForest(y~., df)
model_svm <- svm(y ~ ., df)
## add rcs terms to linear model
## this is a very convenient, objective way to account for non-linearity
## still a "linear" model because terms are linear combinations (additive)
model_lm <- ols(y ~ rcs(X1) + rcs(X2) + rcs(X3) + rcs(X4) + rcs(X5), df)
ex_rf <- explain(model_rf)
ex_svm <- explain(model_svm)
ex_tr <- explain(model_lm, data = df[,-1],
predict_function = function(m, x) f(x[,1], x[,2], x[,3], x[,4], x[,5]),
label = "True Model")
## seems that the `y` argument is required here
ex_lm <- explain(model_lm, data = df[, -1], y = df$y)
plot(single_variable(ex_rf, "X1"),
single_variable(ex_svm, "X1"),
single_variable(ex_lm, "X1"),
single_variable(ex_tr, "X1")) +
ggtitle("Responses for X1. Truth: y ~ (2*x1 - 1)^2")
plot(single_variable(ex_rf, "X2"),
single_variable(ex_svm, "X2"),
single_variable(ex_lm, "X2"),
single_variable(ex_tr, "X2")) +
ggtitle("Responses for X2. Truth: y ~ sin(10 * x2)")
plot(single_variable(ex_rf, "X3"),
single_variable(ex_svm, "X3"),
single_variable(ex_lm, "X3"),
single_variable(ex_tr, "X3")) +
ggtitle("Responses for X3. Truth: y ~ x3^6")
plot(single_variable(ex_rf, "X4"),
single_variable(ex_svm, "X4"),
single_variable(ex_lm, "X4"),
single_variable(ex_tr, "X4")) +
ggtitle("Responses for X4. Truth: y ~ (2 * x4 - 1)")
plot(single_variable(ex_rf, "X5"),
single_variable(ex_svm, "X5"),
single_variable(ex_lm, "X5"),
single_variable(ex_tr, "X5")) +
ggtitle("Responses for X5. Truth: y ~ |2 * x5 - 1|")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.