Data frames in R for Make

Drake is a workflow manager and build system for

Organize your work in a data frame.

library(drake)
load_basic_example()
my_plan

##                    target                                      command
## 1             'report.md'             knit('report.Rmd', quiet = TRUE)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4       regression1_small                                  reg1(small)
## 5       regression1_large                                  reg1(large)
## 6       regression2_small                                  reg2(small)
## 7       regression2_large                                  reg2(large)
## 8  summ_regression1_small suppressWarnings(summary(regression1_small))
## 9  summ_regression1_large suppressWarnings(summary(regression1_large))
## 10 summ_regression2_small suppressWarnings(summary(regression2_small))
## 11 summ_regression2_large suppressWarnings(summary(regression2_large))
## 12 coef_regression1_small              coefficients(regression1_small)
## 13 coef_regression1_large              coefficients(regression1_large)
## 14 coef_regression2_small              coefficients(regression2_small)
## 15 coef_regression2_large              coefficients(regression2_large)

Then make() it to build all your targets.

make(my_plan)

If a target fails, diagnose it.

failed()                 # Targets that failed in the most recent `make()`
diagnose()               # Targets that failed in any previous `make()`
error <- diagnose(large) # Most recent verbose error log of `large`
str(error)               # Object of class "error"
error$calls              # Call stack / traceback

Installation

You can choose among different versions of drake:

install.packages("drake")                                  # Latest CRAN release.
install.packages("devtools")                               # For installing from GitHub.
library(devtools)
install_github("wlandau-lilly/[email protected]", build = TRUE) # Choose a GitHub tag/release.
install_github("wlandau-lilly/drake", build = TRUE)        # Development version.

You must properly install drake using install.packages(), devtools::install_github(), or similar. It is not enough to use devtools::load_all(), particularly for the parallel computing functionality, in which muliple R sessions initialize and then try to require(drake).
For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

Quickstart

library(drake)
load_basic_example()     # Also (over)writes report.Rmd.
vis_drake_graph(my_plan) # Click, drag, pan, hover. See arguments 'from' and 'to'.
outdated(my_plan)        # Which targets need to be (re)built?
missed(my_plan)          # Are you missing anything from your workspace?
check_plan(my_plan)      # Are you missing files? Is your workflow plan okay?
make(my_plan)            # Run the workflow.
diagnose(large)          # View error info if the target "large" failed to build.
outdated(my_plan)        # Everything is up to date.
vis_drake_graph(my_plan) # The graph also shows what is up to date.

Dive deeper into the built-in examples.

drake_example("basic") # Write the code files of the canonical tutorial.
drake_examples()       # List the other examples.
vignette("quickstart") # https://cran.r-project.org/package=drake/vignettes/quickstart.html

Useful functions

make(), workplan(), failed(), and diagnose() are the most important functions. Beyond that, there are functions to learn about drake,

load_basic_example()
drake_tip()
drake_examples()
drake_example()

set up your workflow plan data frame,

workplan()
plan_analyses()
plan_summaries()
evaluate_plan()
expand_plan()
gather_plan()
wildcard() # From the wildcard package.

explore the dependency network,

outdated()
missed()
vis_drake_graph() # Same as drake_graph().
dataframes_graph()
render_drake_graph()
read_drake_graph()
deps()
knitr_deps()
tracked()

interact with the cache,

clean()
drake_gc()
cached()
imported()
built()
readd()
loadd()
find_project()
find_cache()

make use of recorded build times,

build_times()
predict_runtime()
rate_limiting_times()

speed up your project with parallel computing,

make() # with jobs > 2
max_useful_jobs()
parallelism_choices()
shell_file()

finely tune the caching and hashing,

available_hash_algos()
cache_path()
cache_types()
configure_cache()
default_long_hash_algo()
default_short_hash_algo()
long_hash()
short_hash()
new_cache()
recover_cache()
this_cache()
type_of_cache()

and debug your work.

check_plan()
drake_config()
read_drake_config()
diagnose()
dependency_profile()
in_progress()
progress()
rescue_cache()
drake_session()

Reproducibility

There is room to improve the conversation and the landscape of reproducibility in the R and Statistics communities. At a more basic level than scientific replicability, literate programming, and version control, reproducibility carries an implicit promise that alleged computational results really do match the generating code. To reinforce this promise, drake fingerprints and watches dependencies and output, skipping computations that are already up to date.

library(drake)
load_basic_example()
outdated(my_plan)        # Which targets need to be (re)built?
make(my_plan)            # Build what needs to be built.
outdated(my_plan)        # Everything is up to date.
reg2 <- function(d){     # Change one of your functions.
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
outdated(my_plan)        # Some targets depend on reg2().
vis_drake_graph(my_plan) # See arguments 'from' and 'to'.
make(my_plan)            # Rebuild just the outdated targets.
outdated(my_plan)        # Everything is up to date again.
vis_drake_graph(my_plan) # The colors changed in the graph.

Similarly to imported functions like reg2(), drake reacts to changes in

Other imported functions, whether user-defined or from packages.
For imported functions from your environment, any nested functions also in your environment or from packages.
Commands in your workflow plan data frame.
Global varibles mentioned in the commands or imported functions.
Upstream targets.
For dynamic knitr reports (with knit('your_report.Rmd') as a command in your workflow plan data frame), targets and imports mentioned in calls to readd() and loadd() in the code chunks to be evaluated. Drake treats these targets and imports as dependencies of the compiled output target (say, 'report.md').

With alternate triggers and the option to skip imports, you can sacrifice reproducibility to gain speed. However, these options throw the dependency network out of sync. You should only use them for testing and debugging, never for production.

make(..., skip_imports = TRUE, trigger = "missing")

Using different tools, you can enhance reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

High-performance computing

Similarly to Make, drake arranges the intermediate steps of your workflow in a dependency web. This network is the key to drake's parallel computing. For example, consider the network graph of the basic example.

library(drake)
load_basic_example()
make(my_plan, jobs = 2) # See also max_useful_jobs(my_plan).
# Change a dependency.
reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
# Run vis_drake_graph() yourself for interactivity.
# Then hover, click, drag, pan, and zoom.
vis_drake_graph(my_plan, width = "100%")

When you call make(my_plan, jobs = 4), the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4) should be faster than make(..., jobs = 1), but it would be superfluous to use more than 4 jobs. See function max_useful_jobs() to suggest the number of jobs, taking into account which targets are already up to date.

As for the implementation, you can choose from a vast arsenal of parallel backends, from local multicore computing to future.batchtools-powered distributed computing compatible with most formal job schedulers. Please see the parallelism vignette for details.

vignette("parallelism")

Acknowledgements and related work

Many thanks to these people for contributing amazing ideas and code patches early in the development of drake and its predecessors parallelRemake and remakeGenerator.

Special thanks to Jarad, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. It took several months to convince me, and I am glad he succeeded.

The original idea of a time-saving reproducible build system extends back decades to GNU Make, which today helps data scientists as well as the original user base of complied-language programmers. More recently, Rich FitzJohn created remake, a breakthrough reimagining of Make for R and the most important inspiration for drake. Drake is a fresh reinterpretation of some of remake's pioneering fundamental concepts, scaled up for computationally-demanding workflows. There are many other pipeline toolkits, but few are R-focused.

In the sphere of reproducibility, drake and remake are examples of non-literate programming tools (as opposed to literate programming tools such as knitr). Counterparts include R.cache, archivist, trackr, and memoise. See the reproducible research CRAN task view for a more comprehensive list. Drake differentiates itself from these tools with its ability to track the relationships among cached objects and its extensive high-performance computing functionality.

Documentation

The CRAN page links to multiple rendered vignettes.

vignette(package = "drake") # List the vignettes.
vignette("caution")         # Avoid common pitfalls.
vignette("debug")           # Debugging and testing.
vignette("drake")           # High-level intro.
vignette("graph")           # Visualilze the workflow graph.
vignette("quickstart")      # Walk through a simple example.
vignette("parallelism")     # High-performance computing.
vignette("storage")         # Learn how drake stores your stuff.
vignette("timing")          # Build times, runtime predictions

Help and troubleshooting

Please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Contributing

Bug reports, suggestions, and code are welcome. Please see .github/CONTRIBUTING.md. Maintainers and contributors must follow this repository's code of conduct.

kendonb / drake Goto Github PK

drake's Introduction

Data frames in R for Make

Installation

Quickstart

Useful functions

Reproducibility

High-performance computing

Acknowledgements and related work

Documentation

Help and troubleshooting

Contributing

drake's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent