Git Product home page Git Product logo

mlr3pipelines's Introduction

mlr3pipelines

Package website: release | dev

Dataflow Programming for Machine Learning in R.

tic CRAN StackOverflow Mattermost

What is mlr3pipelines?

Watch our “WhyR 2020” Webinar Presentation on Youtube for an introduction! Find the slides here.

WhyR 2020 mlr3pipelines

mlr3pipelines is a dataflow programming toolkit for machine learning in R utilising the mlr3 package. Machine learning workflows can be written as directed “Graphs” that represent data flows between preprocessing, model fitting, and ensemble learning units in an expressive and intuitive language. Using methods from the mlr3tuning package, it is even possible to simultaneously optimize parameters of multiple processing units.

In principle, mlr3pipelines is about defining singular data and model manipulation steps as “PipeOps”:

pca        = po("pca")
filter     = po("filter", filter = mlr3filters::flt("variance"), filter.frac = 0.5)
learner_po = po("learner", learner = lrn("classif.rpart"))

These pipeops can then be combined together to define machine learning pipelines. These can be wrapped in a GraphLearner that behave like any other Learner in mlr3.

graph = pca %>>% filter %>>% learner_po
glrn = GraphLearner$new(graph)

This learner can be used for resampling, benchmarking, and even tuning.

resample(tsk("iris"), glrn, rsmp("cv"))
#> <ResampleResult> of 10 iterations
#> * Task: iris
#> * Learner: pca.variance.classif.rpart
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations

Feature Overview

Single computational steps can be represented as so-called PipeOps, which can then be connected with directed edges in a Graph. The scope of mlr3pipelines is still growing; currently supported features are:

  • Simple data manipulation and preprocessing operations, e.g. PCA, feature filtering
  • Task subsampling for speed and outcome class imbalance handling
  • mlr3 Learner operations for prediction and stacking
  • Simultaneous path branching (data going both ways)
  • Alternative path branching (data going one specific way, controlled by hyperparameters)
  • Ensemble methods and aggregation of predictions

Documentation

A good way to get into mlr3pipelines are the following two vignettes:

Bugs, Questions, Feedback

mlr3pipelines is a free and open source software project that encourages participation and feedback. If you have any issues, questions, suggestions or feedback, please do not hesitate to open an “issue” about it on the GitHub page!

In case of problems / bugs, it is often helpful if you provide a “minimum working example” that showcases the behaviour (but don’t worry about this if the bug is obvious).

Please understand that the resources of the project are limited: response may sometimes be delayed by a few days, and some feature suggestions may be rejected if they are deemed too tangential to the vision behind the project.

Citing mlr3pipelines

If you use mlr3pipelines, please cite our JMLR article:

@Article{mlr3pipelines,
  title = {{mlr3pipelines} - Flexible Machine Learning Pipelines in R},
  author = {Martin Binder and Florian Pfisterer and Michel Lang and Lennart Schneider and Lars Kotthoff and Bernd Bischl},
  journal = {Journal of Machine Learning Research},
  year = {2021},
  volume = {22},
  number = {184},
  pages = {1-7},
  url = {https://jmlr.org/papers/v22/21-0281.html},
}

Similar Projects

A predecessor to this package is the mlrCPO-package, which works with mlr 2.x. Other packages that provide, to varying degree, some preprocessing functionality or machine learning domain specific language, are the caret package and the related recipes project, and the dplyr package.

mlr3pipelines's People

Contributors

be-marc avatar berndbischl avatar coorsaa avatar dandls avatar dependabot[bot] avatar github-actions[bot] avatar ja-thomas avatar jakob-r avatar kant avatar m-muecke avatar mb706 avatar mllg avatar pat-s avatar pfistfl avatar prockenschaub avatar rustylongbow avatar sebffischer avatar sumny avatar vpolisky avatar web-flow avatar zackbarry avatar zzawadz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr3pipelines's Issues

we need a good printer for graphs

  1. the graphical plotter should plot the graph as is

  2. the printer needs to show a compact form on the console.
    we could implement this by topological sorting

with that i mean we sort the graph in layer. then show it like this:

[ids, 1] >> [ids, 3] >> [ids, 2]

the number is the number of elements in the layer, ids is a concatened string of the ids

Repository name

After we have created mlr3learners and mlr3tuning, I'd suggest mlr3pipelines or mlr3pipes here.

Test usecases with mlr3

Test whether they work hypothetically with the current api design.
Later on verify in code.

define how the "chunk / partition data" pattern can be written with the current graph language

michel came up with the usecase today.

imagine, you break up the data set into k, lets say 3, partitions. then we train a model on each partition and do model averaging. this is a common "trick" for simple handling of large data, very old.
but more importantly: if our new "language" is well designed we should be able to represent this pattern

so, one variant is this

op1 = PipeOpDownsample(rate=0.6)
op2 = PipeOpLearner("classif.rpart")
g = rep(3, op1 %>>% op2)

why does this work: because the downsamplimgs are independent of each other. in our example above the partitioning is not

GraphNode: has_no_nexts has_no_prevs has_next

i am unsure if we need this, but maybe this makes code more readable

but we dont use this consistently
also its weird to have this "positively" and "negatively"

how about we have

is_lhs
is_rhs
?
that can also be easily negated. and this seems to be the context in which it is used most ofent?

Graph: map function in parallel?

should this actually be parallelizable? currently i think it isnt because of the way we traverse the graph. can we do that smarter?

should we collect all nodes in a list? then map the function at the end?
actually being able to treat the graph as a list might be nice anyway....?

class Multiplexer

The concept.md file describes a class called Multiplexer. I think it might be an interesting thing, so I just wanted to define here some of its use cases.

Feel free to extend this is. I think it will be much easier to define the Multiplexer, when we will see how it can be used.

Case 1:

## Select one transformation from a few alternatives
      B1
A --> B2  - -> C --> ... -> result
      B3

Case 2:

## Select one algorithm from a few alternatives
## C1-C3 - different algorithms
## useful for finding the best type of model
## e.g C1 is glmnet, C2 - random forest, C3 - something else)

             C1
A --> B  --> C2 --> result
             C3 

Possible problems?

  • make sure that each alternative returns the same object? (is it a real issue?)
  • parameterization (defaults, etc?), important for tuning.
    • how to parameterize the multiplexer for tuning? Eg. If A1 is selected use tune params a,b,c, for A2 d,f and so on. Do we need to be able to define sth like this?

Naming: train and train2

Currently, we have:

  • train: Internal train function used within pipeOp
  • train2: Function obtained from the concrete Op's like pipeOpScaler()

In order to make this more concise, rename as follows ?:

  • train_internal: Internal train function used within pipeOp
  • train: Function obtained from the concrete Op's like pipeOpScaler()

we really have to define how data objects flow through the pipeline.

currently i can only see this:

  • each PO takes a task, transforms it and returns a task
  • we have sublclasses for POs, to make the transformation easier.
    -- PipeOpFeatureTransformer: Takes a dt and returns a dt
    contract is that dt is the feature table, nr of rows or targets are not changed.
    -- PipeOpTask: Takes a task, returns a task

Graph$source_node or Graph$source_nodes

Right now Graph class can only have one source node. I'm not sure if we want to stick with this or do we want to change that to a list of nodes?

Right now we can't define a graph list(a,b) %>>% c. In such case we need to add a PipeOpNull at the beginning PipeOpNull$new() %>>% list(a, b) %>>% c. We could add a S3 method %>>%.list, which will create a graph with PipeOpNull at the beginning, but for me it would be a quite ugly solution.

Preproc + ParamSetRanges

A current problem in mlr that need's to be tackled in mlr3:

# Ranger + PCA Preprocessing
lrn = makePreprocWrapperCaret("classif.ranger", ppc.pca = TRUE, ppc.thresh = 0.9)
# We set the max  of mtry to "p"
p = getTaskNFeats(some_task)
ps = makeParamSet(
  makeNumericParam("mtry", lower = 0, upper = p),
  makeNumericParam("ppc.thresh", lower = 0.5, upper = 1)
)
lrn = makeTuneWrapper(lrn, ps, ...)
resample(lrn, task)

Current problem:
After PCA, only p* << p features remain. We tune over a largely invalid space.

  • @jakob-r Could you quickly explain where exactly earlier problems wrt to symbols in the param set is?
  • We should probably be able to design additional parameters that then overwrites/binds to other params during training.
    One example could be:
makeNumericParam("mtry.perc", lower = 0.1, upper = 1, overwrites = "mtry", trafo = function(x) x*p))

Graph:map function seems weird

the whole function just maps a function over all nodes.
but it is implemented in a pretty complicated way, either as a breadth first or depth first search

also dont we simply need an iterator over the nodes? isnt that the best way to go about that?
especially if map does not care about the order?

Caching

After dealing with the idea of caching in mlr recently, I think this is an important topic for mlr3.
It would be a core base feature and should be integrated right from the start.

While in mlr I just did it for caching filter values for now, we should think of implementing it as a pkg option and make it available for all calls (resample, train, tuning, filtering, etc).

Most calls (dataset, learner, hyperpars) are unique and caching won't have that much of an effect as for filtering (for which the call to generate the filter values is always the same and the subsetting happens afterwards).

However, it can also have a positive effect on "normal" train/test calls:

  • If a run (resample, tuneParams, benchmark) errors and a seed is set, the user can just rerun and profit from the cached calls
  • For tuning methods like grid search settings might be redundant more often and the user can profit from caching
  • Most often it will apply for simple train/test calls without tuning.

I've added a function delete_cache() and get_cache_dir() in my mlr PR to make the cache handling more convenient. We could think about a own class cache for such things.

Please share your opinions.

Implement try(Catch) pipeop

                   tryTrain+---------+
                +--------->+         |
                |tryPredict| nextOpA |
          +--------------->+         |
+---------+tryOp|          +---------+
          +-----+
                +--------->----------+
                |catchTrain|         |
                +--------->+ nextOpB |
               catchPredict|         |
                           +---------+

Cases

  • Train of nextOpA works, we can use nextOpA for prediction
  • Train of nextOpA fails, do something with error (or not) and train nextOpB. we have to use nextOpB for prediction
  • Train of nextOpA works, we can use nextOpA for prediction but this prediction fails. Now what? Actually you would like to go back in time and use train nextOpB but this is not possible during training.
    --> So we need a tryPredictPipeop as well or we have a 3rd edge in tryOp for predictFail? Do we have pipeOps?

PipelineLearner

I think that after creating, the graph should be stored inside a PipeLearner class.

As we discuss, it will be required that the last node will be a Learner so it's parameters like task_type and predict_types will be copied to PipeLearner. Other parameters like packages will be gathered during the initialization of the object.

The object will created by passing the first node of the graph, or by passing the list of PipeOp: PipelineLearner$new(list(op1, op2, op3)).

The train method will call the trainGraph function.

I'm still thinking how to manage the parameters for each node.

Use case: Bagging


Use case: Bagging


k = 100
op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
ops2 = repop(k, op2) # Auto-set Ids? # replicate with s3?
op3 = PipeOpLearner$new("classif.rpart")
ops3 = repop(k, op3)
op4 = PipeOpEnsembleAverage$new() # der muss halt wissen dass nur vorher learner nimmt als input

g1 = GraphNode$new(op1)
gs2 = lapply(ops2, GraphNode$new)
gs3 = lapply(ops3, GraphNode$new)

g1$set_next(gs2)
for (i in 1:k)
  gs2[[i]]$set_next(gs3[[i]])
g4$set_prev(gs3)

can we write the above in a shorter, better way?

op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
op3 = PipeOpLearner$new("classif.rpart")
op4 = PipeOpEnsembleAverage$new()

Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))

My comments:

I think this is a very important stuff for the whole project. I bet that making bagging easy to define using the pipeline will make the whole functionality easy to use.

Let's start with Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4)). I think it would be better to rephrase this as:

p1 <- Pipeline$new(list(op2, op3))
Pipeline$new(list(op1, rep(k, p1), op4))

# or using some sugar
op1 %>>% rep(k, op2 %>>% op3) %>>% op4

It might be a bit easier to reason about because we know which part will be replicated, and we don't need to worry about the sizes of op2 and op3.

     B1->C1 \
A -> B2->C2 -> D
     B3->C3 /

Probably we need to define what happens when the Node has multiple predecessors. For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).

So

A \
B --> D
C /

So in that case, D gets all the results from all previous nodes.

The more interesting problem is when there are multiple successors.

A \   X
B --> Y
C /   Z

Probably the easiest solution is that each node X, Y, Z gets as input all results from previous nodes. So when the nodes lists of nodes will be concatenated it will set all the nodes from the first list as predecessors of the nodes from the second list.

Probably the method name set_prev should be renamed to add_prev, because each node will be able to have multiple predecessors.

So we can rephrase the previous example in the pseudo code as a:

p1 <- list(A, B, C)
p2 <- list(X, Y, Z)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
## It will cause to call the
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
Z$add_prev(A); Z$add_prev(B); Z$add_prev(C)
... operations for op3

# when the size of p1 is not equal to p2
# it works the same way
p1 <- list(A, B, C)
p2 <- list(X, Y)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
... operations for op3

Reinstantiating classes

So as I understand it, we require to re-instantiate a new task after manipulating the data in the PipeOp's.
I guess it would make sense to have helper functions for this, as in most cases task type and target etc. will stay the same.

Does anybody see a better alternative?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.