Git Product home page Git Product logo

mlr3book's Issues

Tutorial request: threshold tuning

Hi again,

threshold tuning is a very important part of binary classification workflows yet in the book it is on the "yet to be added list". I haven't found anything regarding this in the documentation either. Could you post a brief example on how to perform it in mlr3?

Thanks,

Milan

unrelated EDIT1: I am thrilled with database backends. It was much needed.
unrelated EDIT2: Not so thrilled with the fact parallelization levels can not be chosen as in mlr. In my opinion this offered some nice benefits, especially when using some sort of guided hyper parameter search with nested re-sampling.

Broken link

In 03-pipelines.Rmd:

[technical introduction](#extending-pipes)

Train/Test set specification method

Hi mlr3 team,

I am new to the mlr3 package and I try to follow the online guideline book and learn. I find that the book indicates using sample() and setdiff() to generate training and test sets (https://mlr3book.mlr-org.com/train-predict.html). I am wondering if the team considers to include a more advanced method to generate train/test sets based on the ratio of target, something similar to the createDataPartition() function in the caret package? I believe it would be useful for many ML applications. Many thanks.

Chuan

How to set a cost parameter for SVM learner?

Hi I am new to mlr3 and am trying to tune svm for cost parameter. Before using AutoTuner, I try to set the cost parameter at a fixed level, but I don't understand the error message.

When I try:
lrn_svm = lrn("classif.svm", cost=10)

I get an error message:
Error in (function (xs) :
Assertion on 'xs' failed: Condition for 'cost' not ok: type equal C-classification; instead: type=.

Could you help me with this?

Structure: Group extension in their own chapter

Rather than having top-level chapter for extension like "Survival" and friends, I suggest to group them under a heading called "mlr3 Extensions".
Candidates are:

  • Survival
  • Forecasting
  • Spatial
  • Costsens
  • Ordinal

This keeps the structure clean and aligns with the "Extension" principle of the packages.

@mllg

fix unclear sentence about resampling with multiple learners

The following sentences are in 2.5.3:

Note that if you want to compare multiple learners, you should ensure to that each learner operates on the same resampling instance by manually instantiating beforehand.
This reduces the variance of the performance estimation.

The first sentence is not grammatically correct ("...ensure to that..." is not right) but I wasn't sure how to fix it because I wasn't actually sure what this is saying. Can it be rephrased and made clearer?

PDF: Consider using pinp output format

I've just added basic support for the pinp output format for PDF to the book

make pinp

Overall it looks much more professional than bookdown::pdf_book(). Though it needs some fine tuning. Also figures are oversized right now.

Maybe @eddelbuettel could take a look and provide some pointers based on his expertise? :)

Add mlr3pipelines section

The mlr3pipelines section is still in the attic.
Since pipelines is now on CRAN, we might want to add it.

I guess we have to iterate this section again to account for changes made in mlr3 since then, other than that, we should be able to merge.

add more details about resampling strategies

I suggest adding, in section 2.5.1 not just the dictionary lookup of mlr_resamplings as is currently there, but also the full output of

> as.data.table(mlr_resamplings)
           key        params iters
1:   bootstrap repeats,ratio    30
2:      custom                   0
3:          cv         folds    10
4:     holdout         ratio     1
5: repeated_cv repeats,folds   100
6: subsampling repeats,ratio    30

as this provides quite a bit of more useful info to the reader. I ended up digging around for a few minutes and running this command myself and then the lightbulb went off. It's also helpful for me to have the underlying structures here reinforced, and the idea that each of these things can be accessed separately.

Cheatsheet

Should be automated and NOT written manually (e.g. PowerPoint).

Optimize images in PDF

  • use .epg/.svg

  • Find out which command is the best for inserting images for both HTML and PDF version

description of row ids not clear

In the Tasks section (2.2) there is a section that is unclear about how rows are indexed.
It says:

In `r mlr_pkg("mlr3")`, each row (observation) has a unique identifier, stored as an `intger()`.
These can be passed as arguments to the `$data()` method to select specific rows:

However, it's not clear to me what an intger() is? maybe this is supposed to be an integer, but then in the next example, the task row_ids are characters. Maybe rephrase to something like this?

In `r mlr_pkg("mlr3")`, each row (observation) has a unique identifier. Sometimes these unique identifiers are characters and sometimes integers.

Caching?

It takes a long time to render the book -- why is caching turned off?

Tutorial: tuning over whole model spaces like with mlr::makeModelMultiplexer

First I'd like to share my appreciation about mlr3.

As a mlr user I was a bit skeptical about mlr3 since the whole R6 class thing was a nuisance that I had to learn in order to transition effectively.

After a few days of playing with mlr3 I must say I love it.

To the point.

I was very fond of mlr::makeModelMultiplexer and I could not find tutorials which would explain how to recreate it within mlr3 so I decided to create one. It looks like the code does what I intended, so I figured it might help other mlr3 users. Anyway the tutorial is below, if you like it and think it might be beneficial to mlr3 users you have my permission to modify it and share it any way you deem fit.

An example of tuning over whole model spaces with mlr3.

This example consists of tuning the filter method and the number of features selected, as well as the type of learner along with its hyper parameters

library(mlr3) 
library(mlr3pipelines)
library(mlr3learners)
library(mlr3tuning)
library(mlr3filters)
library(visNetwork)
library(paradox)

define filters to be used

filt_null <- mlr_pipeops$get("nop")

filt_ig <- mlr_pipeops$get("filter",
                           mlr_filters$get("information_gain"))

filt_mrmr <- mlr_pipeops$get("filter",
                             mlr_filters$get("mrmr"))


filt_names <- c("nop",
                "information_gain",
                "mrmr")

create a branch to these filters

graph <- mlr_pipeops$get("branch", filt_names, id = "branch1") %>>%
  gunion(list(
    filt_null,
    filt_ig,
    filt_mrmr
  ))

unbranch

graph <- graph %>>% 
  mlr_pipeops$get("unbranch",
                  filt_names,
                  id = "unbranch1") 

define learners to be used

rpart_lrn <- mlr_pipeops$get("learner",
                             learner = mlr_learners$get("classif.rpart"))
rpart_lrn$learner$predict_type <- "prob"


ranger_lrn <- mlr_pipeops$get("learner",
                              learner = mlr_learners$get("classif.ranger"))
ranger_lrn $learner$predict_type <- "prob"

lrn_names <- c("classif.rpart",
               "classif.ranger")

add learners to the graph

graph <- graph %>>% 
  mlr_pipeops$get("branch", lrn_names, id = "branch2") %>>% 
  gunion(list(
    rpart_lrn,
    ranger_lrn)) %>>% 
  mlr_pipeops$get("unbranch", lrn_names, id = "unbranch2") 

how does it look?

graph$plot(html = TRUE)

create a parameter set of all the things we would like tuned

ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 0.05),
  ParamInt$new("classif.ranger.mtry", lower = 1L, upper = 20L),
  ParamFct$new("classif.ranger.splitrule", levels = c("extratrees",
                                                      "gini")),
  ParamDbl$new("classif.ranger.sample.fraction", lower = 0.3, upper = 1),
  ParamInt$new("classif.ranger.num.trees", lower = 100L, upper = 2000L),
  ParamInt$new("classif.ranger.num.random.splits", lower = 1L, upper = 20L),
  ParamInt$new("information_gain.filter.nfeat", lower = 20L, upper = 60L),
  ParamFct$new("information_gain.type", levels = c("infogain", "symuncert")),
  ParamInt$new("mrmr.filter.nfeat", lower = 20L, upper = 60L),
  ParamFct$new("branch1.selection", levels = filt_names),
  ParamFct$new("branch2.selection", levels = lrn_names )
  
))

define dependencies in the parameter set

ps$add_dep("classif.rpart.cp",
           "branch2.selection", CondEqual$new("classif.rpart"))

ps$add_dep("classif.ranger.mtry",
           "branch2.selection", CondEqual$new("classif.ranger"))
ps$add_dep("classif.ranger.splitrule",
           "branch2.selection", CondEqual$new("classif.ranger"))
ps$add_dep("classif.ranger.sample.fraction",
           "branch2.selection", CondEqual$new("classif.ranger"))
ps$add_dep("classif.ranger.num.trees",
           "branch2.selection", CondEqual$new("classif.ranger"))
ps$add_dep("classif.ranger.num.random.splits",
           "classif.ranger.splitrule", CondEqual$new("extratrees"))

ps$add_dep("information_gain.filter.nfeat",
           "branch1.selection", CondEqual$new("information_gain"))
ps$add_dep("information_gain.type",
           "branch1.selection", CondEqual$new("information_gain"))
ps$add_dep("mrmr.filter.nfeat",
           "branch1.selection", CondEqual$new("mrmr"))

convert graph to learner and define resampling, measures and random search

glrn <- GraphLearner$new(graph) 

glrn$predict_type <- "prob"

cv5 <- rsmp("cv", folds = 5)

tsk <- mlr_tasks$get("sonar")

stratified sampling based on target

tsk$col_roles$stratum <- tsk$target_names

create instance and tune

instance <- TuningInstance$new(
  task = tsk,
  learner = glrn,
  resampling = cv5,
  measures = msr("classif.auc"),
  param_set = ps,
  terminator = term("evals", n_evals = 20)
)


tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result

Kind regards,

Milan

Nested resampling description contains an error

First of all thank you for creating mlr3 documentation. It really helps out a lot to understand its breaking changes compared to the older mlr version. Having checked the book at the tuning part here: https://mlr3book.mlr-org.com/tuning.html I recognized an error.
In the section 4.1.5 you state "Now the optimized hyperparameters can be used to create a new Learner and train it on the full dataset." Doing this step would only result in an overfitted model.
I tried it out to ensure the overfitting using a regression model and indeed the classification error of this model was then 0 while the last resampled model was at 0.03

Suggestion:
After having defined a new learner with the optimized parameters paste
learner$train(task, row_ids = train_set) to train the best model with the train set only to prevent overfitting.

Pages not rendering locally

When I build and serve the book locally (pkgload::load_all(); serve_mlr3book()), I'm getting error pages for everything except the index page. In particular, the error is

This page contains the following errors:
error on line 123 at column 8: Opening and ending tag mismatch: link line 0 and head
Below is a rendering of the page up to the first error.

There is nothing below the error message. A quick look at the HTML shows that there are some link tags in the head section that are not terminated, but I don't know where those are generated from.

Showcase bigger benchmarks on HPC / Multicore Systems

mlr3 offers great flexibility for writing down and executing even bigger benchmarks
on HPC systems through future and future.batchtools.

This is as easy as writing the code below, but we need a good template / example on how to do this.

Additionally, future seems to expose several parameters that have to be correctly set (?) for things to work more reliably.

library("mlr3")
library("batchtools")
library("future.batchtools")

plan(batchtools_slurm, template = "~/slurm.tmpl")

design = benchmark_grid(
  tasks = tsk("iris"),
  learners = list(lrn("classif.rpart"), lrn("classif.featureless")),
  resamplings = rsmp("cv")
)
benchmark(design)

General comments:

# insure agains segfaults.
learner$encapsulate

Things I would like to see:

  • Some intro into how mlr3 unnests the benchmark, how this works if tuning is involved.
  • An explanation, in which cases we have nested futures etc.
  • Can I restart failed jobs, how?
  • What happens if I see e.g. that a Learner used in the benchmark was miss-configured.
    Can I fix the learner and restart the jobs?
    Or more specifically, what is the point where I should switch back to batchtools?
  • Which configuration parameters of future are relevant there?

describe column roles more clearly

The description of column roles is not sufficient when I first encounter it in 2.2 Tasks. There are a few lines of code showing how to change the role (although it's not clear to me at first read which of the last two of the following lines actually makes the change):

task$feature_names
## [1] "cyl"  "disp" "rn"

# working with a list of column vectors
task$col_roles$name = "rn"
task$col_roles$feature = setdiff(task$col_roles$feature, "rn")

And it appears, upon closer examination, that there is quite a number of different roles that columns can have. Maybe this is explained in more detail somewhere else and a link could be provided here? E.g. I found lots of other potential roles, it appears:

> task$col_roles
$feature
[1] "cyl"  "disp" "rn"  

$target
[1] "mpg"

$name
character(0)

$order
character(0)

$stratum
character(0)

$group
character(0)

$weight
character(0)

`make pdf` broken

Ubuntu 19.10, everything current, make clean done, remotes update done. Yet

[...]
label: 98-appendix-004 (with options) 
List of 1
 $ echo: logi FALSE


   inline R code fragments


output file: mlr3book.knit.md

/home/edd/bin/pandoc +RTS -K512m -RTS mlr3book.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output mlr3book.tex --self-contained --table-of-contents --toc-depth 4 --number-sections --highlight-style haddock --pdf-engine xelatex --natbib --include-in-header preamble.tex --variable graphics --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmd/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmd/lua/latex-div.lua --wrap preserve --include-in-header /tmp/RtmpXmq6Jr/rmarkdown-str1e8b5f04364e.html --variable tables=yes --standalone 
! Argument of \@sect has an extra }.
<inserted text> 
                \par 
l.3768 ...Impute}}\label{imputation-pipeopimpute}}

Error: Failed to compile mlr3book.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. See mlr3book.log for more info.
In addition: There were 14 warnings (use warnings() to see them)
Execution halted
make: *** [Makefile:29: pdf] Error 1
edd@rob:~/git/mlr3book(master)$ 

with

(Font)              scaled to size 8.00085pt on input line 3768.
LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(3)/m/n' will be
(Font)              scaled to size 11.99872pt on input line 3768.
LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(3)/m/n' will be
(Font)              scaled to size 7.99915pt on input line 3768.
! Argument of \@sect has an extra }.
<inserted text> 
                \par 
l.3768 ...Impute}}\label{imputation-pipeopimpute}}
                                                   
Here is how much of TeX's memory you used:
 33949 strings out of 494862
 621471 string characters out of 6175651
 1000110 words of memory out of 5000000
 37450 multiletter control sequences out of 15000+600000
 543804 words of font info for 126 fonts, out of 8000000 for 9000
 14 hyphenation exceptions out of 8191
 60i,9n,118p,9107b,601s stack positions out of 5000i,500n,10000p,200000b,80000s

Output written on mlr3book.pdf (129 pages).

edd@rob:~/git/mlr3book(master)$ tail -20 bookdown/mlr3book.log 

Happy to run any diagnostics. I poked around a little to no avail last evening and didn't manage to fix it.

Edit: Accidentally had a make html log at first, now showing make pdf. The actual TeX logfile was accurate.

2.7.2 Binary classification - Threshold Tuning incomplete

Section needs to be extended.

current content:
"When we are interested in class labels based on scores or probabilities, we can set the classification threshold according to our target performance measure.
This threshold however can also be tuned, since the optimal threshold might differ for different (custom) measures or in situations like const-sensitive classification.

This can be also done with mlr3."

Fix texlive installation

Tinytex is not working (missing package unicode-math, required for xelatex, and tlmgr is not working properly). This could be a temporary problem, I haven't found any bug reports.

Either switch back to tinytex, switch back to xelatex or switch to circle-ci.

Cannot serve / build the book on Mac

Getting this error, when trying to serve or to build the book on Mac:

Quitting from lines 759-774 (mlr3book.Rmd) 
Fehler in ref("expand_grid()") : 
  Could not find help page for topic 'expand_grid()'
Ruft auf: <Anonymous> ... inline_exec -> hook_eval -> withVisible -> eval -> eval -> ref
Ausführung angehalten

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.