mlr-org / mlr3pipelines Goto Github PK

View Code? Open in Web Editor NEW

130.0 18.0 24.0 16.4 MB

Dataflow Programming for Machine Learning in R

Home Page: https://mlr3pipelines.mlr-org.com/

License: GNU Lesser General Public License v3.0

R 99.78% Shell 0.22%

mlr3 machine-learning pipelines preprocessing bagging stacking ensemble-learning r dataflow-programming r-package

mlr3pipelines's Introduction

mlr3pipelines

Package website: release | dev

Dataflow Programming for Machine Learning in R.

What is `mlr3pipelines`?

Watch our “WhyR 2020” Webinar Presentation on Youtube for an introduction! Find the slides here.

mlr3pipelines is a dataflow programming toolkit for machine learning in R utilising the mlr3 package. Machine learning workflows can be written as directed “Graphs” that represent data flows between preprocessing, model fitting, and ensemble learning units in an expressive and intuitive language. Using methods from the mlr3tuning package, it is even possible to simultaneously optimize parameters of multiple processing units.

In principle, mlr3pipelines is about defining singular data and model manipulation steps as “PipeOps”:

pca        = po("pca")
filter     = po("filter", filter = mlr3filters::flt("variance"), filter.frac = 0.5)
learner_po = po("learner", learner = lrn("classif.rpart"))

These pipeops can then be combined together to define machine learning pipelines. These can be wrapped in a GraphLearner that behave like any other Learner in mlr3.

graph = pca %>>% filter %>>% learner_po
glrn = GraphLearner$new(graph)

This learner can be used for resampling, benchmarking, and even tuning.

resample(tsk("iris"), glrn, rsmp("cv"))
#> <ResampleResult> of 10 iterations
#> * Task: iris
#> * Learner: pca.variance.classif.rpart
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations

Feature Overview

Single computational steps can be represented as so-called PipeOps, which can then be connected with directed edges in a Graph. The scope of mlr3pipelines is still growing; currently supported features are:

Simple data manipulation and preprocessing operations, e.g. PCA, feature filtering
Task subsampling for speed and outcome class imbalance handling
mlr3 Learner operations for prediction and stacking
Simultaneous path branching (data going both ways)
Alternative path branching (data going one specific way, controlled by hyperparameters)
Ensemble methods and aggregation of predictions

Documentation

A good way to get into mlr3pipelines are the following two vignettes:

Bugs, Questions, Feedback

mlr3pipelines is a free and open source software project that encourages participation and feedback. If you have any issues, questions, suggestions or feedback, please do not hesitate to open an “issue” about it on the GitHub page!

In case of problems / bugs, it is often helpful if you provide a “minimum working example” that showcases the behaviour (but don’t worry about this if the bug is obvious).

Please understand that the resources of the project are limited: response may sometimes be delayed by a few days, and some feature suggestions may be rejected if they are deemed too tangential to the vision behind the project.

Citing mlr3pipelines

If you use mlr3pipelines, please cite our JMLR article:

@Article{mlr3pipelines,
  title = {{mlr3pipelines} - Flexible Machine Learning Pipelines in R},
  author = {Martin Binder and Florian Pfisterer and Michel Lang and Lennart Schneider and Lars Kotthoff and Bernd Bischl},
  journal = {Journal of Machine Learning Research},
  year = {2021},
  volume = {22},
  number = {184},
  pages = {1-7},
  url = {https://jmlr.org/papers/v22/21-0281.html},
}

Similar Projects

A predecessor to this package is the mlrCPO-package, which works with mlr 2.x. Other packages that provide, to varying degree, some preprocessing functionality or machine learning domain specific language, are the caret package and the related recipes project, and the dplyr package.

mlr3pipelines's People

Contributors

Stargazers

Watchers

mlr3pipelines's Issues

can we delegate the whole graph stuff to a well written library?

options:
a) igraph
b) boost graph
c) lemon ?
https://stackoverflow.com/questions/4917401/lemon-graph-library-on-r-using-rcpp

add cache for pipeline ops

we need a good printer for graphs

the graphical plotter should plot the graph as is
the printer needs to show a compact form on the console.
we could implement this by topological sorting

with that i mean we sort the graph in layer. then show it like this:

[ids, 1] >> [ids, 3] >> [ids, 2]

the number is the number of elements in the layer, ids is a concatened string of the ids

Repository name

After we have created mlr3learners and mlr3tuning, I'd suggest mlr3pipelines or mlr3pipes here.

sperate the PipeOp class from a GraphNode class, and add a class for a Graph

The PipeOp seems to be fundamentally different than the node in the graph. Also it it specified like this in the design doc.

If we dont want this, this should be discussed and the design docs updated

remove all instances of "<-"

Test usecases with mlr3

Test whether they work hypothetically with the current api design.
Later on verify in code.

Graph: gather_params

should be an internal method of the class.

just put this under its AB

define how the "chunk / partition data" pattern can be written with the current graph language

michel came up with the usecase today.

imagine, you break up the data set into k, lets say 3, partitions. then we train a model on each partition and do model averaging. this is a common "trick" for simple handling of large data, very old.
but more importantly: if our new "language" is well designed we should be able to represent this pattern

so, one variant is this

op1 = PipeOpDownsample(rate=0.6)
op2 = PipeOpLearner("classif.rpart")
g = rep(3, op1 %>>% op2)

why does this work: because the downsamplimgs are independent of each other. in our example above the partitioning is not

Graph: can we implement an iterator?

does R have that concept?

traverseGraph should potentially be a method of Graph. it also should be called "map"

PipeOp Printer: Specify format we want and implement it

op1 = PipeOpNULL$new()
print(op1)

Loading mlr3pipelines
PipeOp:
parvals: <>
is_learnt=FALSE
Error in $.R6(self, inputs) :
R6 class PipeOpNULL/PipeOp/R6 does not have slot 'inputs'!

Please also unit test

GraphNode: has_no_nexts has_no_prevs has_next

i am unsure if we need this, but maybe this makes code more readable

but we dont use this consistently
also its weird to have this "positively" and "negatively"

how about we have

is_lhs
is_rhs
?
that can also be easily negated. and this seems to be the context in which it is used most ofent?

Graph: map function in parallel?

should this actually be parallelizable? currently i think it isnt because of the way we traverse the graph. can we do that smarter?

should we collect all nodes in a list? then map the function at the end?
actually being able to treat the graph as a list might be nice anyway....?

Graph needs needs a toposort method

this should return a list of list of nodes, in layers

write a simple vignette that explains linear pipelines with preprocessung

graph: what happens if we change nodes and connections thorugh nodes in a graph?

at least if we do something with the source nodes, the graph can become invalid?

Discuss naming of functions and classes

Quick template, will add more later

R6 Class: PipeOp
Object: Naming: par_set or parset ? same for par_vals.

add verbosity and debug level messages and options

add asserts for toplevel functions

we dont need to do this for all internal helpers, but at least for toplevel API lets do this now

trainGraph should be a method of Graph

Additionally, it should work on a Graph, not a GraphNode

class Multiplexer

The concept.md file describes a class called Multiplexer. I think it might be an interesting thing, so I just wanted to define here some of its use cases.

Feel free to extend this is. I think it will be much easier to define the Multiplexer, when we will see how it can be used.

Case 1:

## Select one transformation from a few alternatives
      B1
A --> B2  - -> C --> ... -> result
      B3

Case 2:

## Select one algorithm from a few alternatives
## C1-C3 - different algorithms
## useful for finding the best type of model
## e.g C1 is glmnet, C2 - random forest, C3 - something else)

             C1
A --> B  --> C2 --> result
             C3

Possible problems?

make sure that each alternative returns the same object? (is it a real issue?)
parameterization (defaults, etc?), important for tuning.
- how to parameterize the multiplexer for tuning? Eg. If A1 is selected use tune params a,b,c, for A2 d,f and so on. Do we need to be able to define sth like this?

Naming: train and train2

Currently, we have:

train: Internal train function used within pipeOp
train2: Function obtained from the concrete Op's like pipeOpScaler()

In order to make this more concise, rename as follows ?:

train_internal: Internal train function used within pipeOp
train: Function obtained from the concrete Op's like pipeOpScaler()

PipeOpNull and PipeOpFeatureUnion can not be is_learnt

is_learnt basically checks whether private$.params is set.

Methods that do not require private$.params, we might want to set this to
list() or something?

we really have to define how data objects flow through the pipeline.

currently i can only see this:

each PO takes a task, transforms it and returns a task
we have sublclasses for POs, to make the transformation easier.
-- PipeOpFeatureTransformer: Takes a dt and returns a dt
contract is that dt is the feature table, nr of rows or targets are not changed.
-- PipeOpTask: Takes a task, returns a task

add AB for par.vals of CPO and check feasibility of list

implement lhs and rhs activebindings for Graph

https://docs.google.com/document/d/1JbzulUkLrMS0Xyk38NF5SyhpAfr2FIFsh9FJ8yiDtbE/edit

Graph$source_node or Graph$source_nodes

Right now Graph class can only have one source node. I'm not sure if we want to stick with this or do we want to change that to a list of nodes?

Right now we can't define a graph list(a,b) %>>% c. In such case we need to add a PipeOpNull at the beginning PipeOpNull$new() %>>% list(a, b) %>>% c. We could add a S3 method %>>%.list, which will create a graph with PipeOpNull at the beginning, but for me it would be a quite ugly solution.

Graph: packages should be AB not a slot

the graph needs to compute this on the fly, from the nodes / pipeops

Preproc + ParamSetRanges

A current problem in mlr that need's to be tackled in mlr3:

# Ranger + PCA Preprocessing
lrn = makePreprocWrapperCaret("classif.ranger", ppc.pca = TRUE, ppc.thresh = 0.9)
# We set the max  of mtry to "p"
p = getTaskNFeats(some_task)
ps = makeParamSet(
  makeNumericParam("mtry", lower = 0, upper = p),
  makeNumericParam("ppc.thresh", lower = 0.5, upper = 1)
)
lrn = makeTuneWrapper(lrn, ps, ...)
resample(lrn, task)

Current problem:
After PCA, only p* << p features remain. We tune over a largely invalid space.

@jakob-r Could you quickly explain where exactly earlier problems wrt to symbols in the param set is?
We should probably be able to design additional parameters that then overwrites/binds to other params during training.
One example could be:

makeNumericParam("mtry.perc", lower = 0.1, upper = 1, overwrites = "mtry", trafo = function(x) x*p))

document the Graphclass

Graph:map function seems weird

the whole function just maps a function over all nodes.
but it is implemented in a pretty complicated way, either as a breadth first or depth first search

also dont we simply need an iterator over the nodes? isnt that the best way to go about that?
especially if map does not care about the order?

we need to define a short nice clean system to create graphs / pipelines quickly from pipeops

Caching

After dealing with the idea of caching in mlr recently, I think this is an important topic for mlr3.
It would be a core base feature and should be integrated right from the start.

While in mlr I just did it for caching filter values for now, we should think of implementing it as a pkg option and make it available for all calls (resample, train, tuning, filtering, etc).

Most calls (dataset, learner, hyperpars) are unique and caching won't have that much of an effect as for filtering (for which the call to generate the filter values is always the same and the subsetting happens afterwards).

However, it can also have a positive effect on "normal" train/test calls:

If a run (resample, tuneParams, benchmark) errors and a seed is set, the user can just rerun and profit from the cached calls
For tuning methods like grid search settings might be redundant more often and the user can profit from caching
Most often it will apply for simple train/test calls without tuning.

I've added a function delete_cache() and get_cache_dir() in my mlr PR to make the cache handling more convenient. We could think about a own class cache for such things.

Please share your opinions.

add cbind operation, thats most important for now

a Graph should have a deep copy method

Graph cannot be initialized from a single source node

we now must allow multiple sources

Implement try(Catch) pipeop

                   tryTrain+---------+
                +--------->+         |
                |tryPredict| nextOpA |
          +--------------->+         |
+---------+tryOp|          +---------+
          +-----+
                +--------->----------+
                |catchTrain|         |
                +--------->+ nextOpB |
               catchPredict|         |
                           +---------+

Cases

Train of nextOpA works, we can use nextOpA for prediction
Train of nextOpA fails, do something with error (or not) and train nextOpB. we have to use nextOpB for prediction
Train of nextOpA works, we can use nextOpA for prediction but this prediction fails. Now what? Actually you would like to go back in time and use train nextOpB but this is not possible during training.
--> So we need a tryPredictPipeop as well or we have a 3rd edge in tryOp for predictFail? Do we have pipeOps?

document classes and methods and functions

we need this now, otherwise we cannot discuss and communicate API

we should define how threshold tuning should look like

write a basic vignette that explain the concept of pipeops and graphs

PipelineLearner

I think that after creating, the graph should be stored inside a PipeLearner class.

As we discuss, it will be required that the last node will be a Learner so it's parameters like task_type and predict_types will be copied to PipeLearner. Other parameters like packages will be gathered during the initialization of the object.

The object will created by passing the first node of the graph, or by passing the list of PipeOp: PipelineLearner$new(list(op1, op2, op3)).

The train method will call the trainGraph function.

I'm still thinking how to manage the parameters for each node.

PipeOp's Package Slot

Should PipeOp's have a Package Slot?

Example: Sparse Matrix PCA is implemented in irlba (https://cran.r-project.org/web/packages/irlba/index.html).

Do we even want to support interfaces to other packages?

add gmessage type of helper functions

Use case: Bagging

k = 100
op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
ops2 = repop(k, op2) # Auto-set Ids? # replicate with s3?
op3 = PipeOpLearner$new("classif.rpart")
ops3 = repop(k, op3)
op4 = PipeOpEnsembleAverage$new() # der muss halt wissen dass nur vorher learner nimmt als input

g1 = GraphNode$new(op1)
gs2 = lapply(ops2, GraphNode$new)
gs3 = lapply(ops3, GraphNode$new)

g1$set_next(gs2)
for (i in 1:k)
  gs2[[i]]$set_next(gs3[[i]])
g4$set_prev(gs3)

can we write the above in a shorter, better way?

op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
op3 = PipeOpLearner$new("classif.rpart")
op4 = PipeOpEnsembleAverage$new()

Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))

My comments:

I think this is a very important stuff for the whole project. I bet that making bagging easy to define using the pipeline will make the whole functionality easy to use.

Let's start with Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4)). I think it would be better to rephrase this as:

p1 <- Pipeline$new(list(op2, op3))
Pipeline$new(list(op1, rep(k, p1), op4))

# or using some sugar
op1 %>>% rep(k, op2 %>>% op3) %>>% op4

It might be a bit easier to reason about because we know which part will be replicated, and we don't need to worry about the sizes of op2 and op3.

     B1->C1 \
A -> B2->C2 -> D
     B3->C3 /

Probably we need to define what happens when the Node has multiple predecessors. For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).

A \
B --> D
C /

So in that case, D gets all the results from all previous nodes.

The more interesting problem is when there are multiple successors.

A \   X
B --> Y
C /   Z

Probably the easiest solution is that each node X, Y, Z gets as input all results from previous nodes. So when the nodes lists of nodes will be concatenated it will set all the nodes from the first list as predecessors of the nodes from the second list.

Probably the method name set_prev should be renamed to add_prev, because each node will be able to have multiple predecessors.

So we can rephrase the previous example in the pseudo code as a:

p1 <- list(A, B, C)
p2 <- list(X, Y, Z)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
## It will cause to call the
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
Z$add_prev(A); Z$add_prev(B); Z$add_prev(C)
... operations for op3

# when the size of p1 is not equal to p2
# it works the same way
p1 <- list(A, B, C)
p2 <- list(X, Y)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
... operations for op3

Test & verify that %>>% does a deep copy

Naming: PipeOp -> GraphNode -> Graph | Pipeline?

Collecting proposals here!

graphnode: setnext, etc should be defined internally

not like this

set_next = graph_node_set_next,
set_prev = graph_node_set_prev,
add_next = graph_node_add_next,
add_prev = graph_node_add_prev,

do not use paradox:: or BBmisc:: in source code

in cases like these, simply import the packages and simply directly refer to function names, otherwise code gets too long

Reinstantiating classes

So as I understand it, we require to re-instantiate a new task after manipulating the data in the PipeOp's.
I guess it would make sense to have helper functions for this, as in most cases task type and target etc. will stay the same.

Does anybody see a better alternative?