richfitz / remake Goto Github PK

Make-like declarative workflows in R

License: Other

Makefile 0.25% R 99.75%

remake's Introduction

👋 Hi, I'm Rich

I work at the MRC Centre for Global Infectious Disease Analysis, Imperial College London, as head of the RESIDE research software engineering research.

Most projects in my personal namespace are either historical and/or personal projects that I wrote for fun (e.g. rainbowrite, rfiglet or stegasaur. Actively maintained projects here include redux and thor.

Most of my current software can be found in the mrc-ide, vimc and reside-ic organisations (among others)

(Profile photo from a particularly wet ascent of Cresent Climb, Pavey Ark in December 2019)

remake's People

Contributors

Stargazers

Watchers

remake's Issues

use_remake

As suggested by @aammd, generate a skeleton with some helpful starting points.

Dependency plot for specified node

diagram() function is awesome but currently only makes plot for all targets. Would it be hard to modify this to plot for specified target?

Port from rainbowrite to gaborcsardi/crayon

It's up on CRAN and therefore easier to depend on.

style <- make_style("orange") # can use R's colour names

Passing strings to functions

Error in .remake_add_targets_implied(obj) : 
  Implicitly created targets must all be files:
 - PET: (in pet.part.res) -- did you mean: pet.part.res

I'm just trying to pass a string "PET" to the function. Works fine outside of remake.

Not sure if this is a bug or if I'm not using remake properly.

Facility to always run some commands

Similar in some ways to make's .PHONY targets, some commands might want to always run (but after establishing that the dependencies are met).

Refactor knitr support

Make the functions that corral things into place for knitr more obvious and exported so it's not such a pain to keep up to date.

See

#41
#14
output script support (?)

Better warning when source fails

If one of sources in maker.yml has incorrect formatting (e.g. missing bracket) such that it fails to source this causes maker to choke. But there's little pointing to actual cause in current error message

Error in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : 
  6:1: unexpected '}'
5:     dest)
6: }

Warn when some packages not installed?

Since maker.yml can include the list of packages required, and m$install_packages() installs the missing ones, would it be useful to have a warning like,

> m <- maker()
Warning: some required packages not installed. Run m$install_packages() or proceed at your own risk.

Or even as an argument:

m <- maker(install_packages=TRUE)

New package name, as maker won't be allowed on CRAN

The package needs a new name 😢

* checking CRAN incoming feasibility ... ERROR
Maintainer: ‘"Rich FitzJohn" <[email protected]>’
New submission
Conflicting package names (submitted: maker, existing: makeR [CRAN archive])
Components with restrictions and base license permitting such:
  BSD_2_clause + file LICENSE
File 'LICENSE':
  YEAR: 2014
  COPYRIGHT HOLDER: Richard G. FitzJohn

Here's the conflicting package.

Other conflicts:

Of course, Hadley pointed out that this was likely to be a problem a while ago, but I'd not run R CMD check with --as-cran. So, new name ideas.

This should only change the name of the actual package, main object/function/script and repo. References to make will probably stay as-is.

Ideas @mwpennell, @dfalster, @aammd? Something that a punny logo can be made from?

Plotting black boxes

This is for the remake object from https://github.com/wcornwell/TaxonLookup

Not sure that black box is optimal readability

remake doesn't treat functional sequences the same as functions

I finally managed to reproduce my bug! It turns out that changes in functional sequences composed with %>% don't trigger a rebuild of the target, but a true function will.

I've created a repository to hold my minimal reproducible example. in this commit (aammd/remake_fun_seq@9ee1161) i ran remake and got the expected result; in aammd/remake_fun_seq@0bf133a i changed a value in a functional sequence and ran remake::make(). This didn't trigger a rebuild of the target data3.

This might be an enhancement, not a bug, because you have other things to do besides pander to fashion victims like myself.

On the other hand, I'd argue that the function composition technique encouraged by magrittr is a very remake-ish way to code: there is only one input and one output, and it is easier to type (no need to write function(x) every time you want to simply transform an object).

Depend on package versions

I've removed this feature for now, because

It would require rebuilding after every upgrade of R (because tools, utils, etc will change version number)
If packages are loaded by a target, the version of those packages is not included in the list (because the version list is determined at maker startup). That then leads to random rebuilds if make is run a second time because now there is a new package loaded! (this interacts with #12).
Package version information is not granular enough for rapidly changing packages (e.g. if you are developing a package as part of a research problem then you'll want to depend on the code or the git version of that package)

A better interface might be to declare which packages to depend on in different ways (code inspection, number version, git version).

Support adding additional repos in remake_sources to support drat

Did you mean?

It's quite common to make spelling mistakes when writing maker files. The current error messages are OK, but could usefully suggest spelling as one of possible problems.

Better yet, you could add a "Did you mean" function, which compares the incorrect name against actual names and makes suggestions.

@RemkoDuursma mentioned a recent R package that might be helpful here, perhaps Rdym?

Script export function misses target-specific packages

As in this example this example, sometimes we have a maker file in which packages are loaded for specific target.

However, when I run

m <- maker::maker()
m$script()

those packages do not make it into the script, so that script fails to run.

Add rocco docs?

You can use this package to document your code.
Your code is already written in rocco compliant way! Sounds like an easy victory and makes reading your code easier.

Check out http://robertzk.github.io/cachemeifyoucan/ for an example

Let target commands take non-target arguments

e.g.

manuscript.html:
  command: markdown::render(knit_manuscript.md, output_format="html_document")

Variables for object names and filenames

Is it possible to use variables in place of object names and filenames? This was one aspect of make that seemed quite convenient.

For example, say I have 3 .csv files in the folder razzle/ : dazzle.csv, fiddle.csv and faddle.csv. I want to read them all into objects of the same name.

Must I write:

  dazzle:
    command: read.csv(file="razzle/dazzle.csv",stringsAsFactors=FALSE)

  fiddle:
    command: read.csv(file="razzle/fiddle.csv",stringsAsFactors=FALSE)

  faddle:
    command: read.csv(file="razzle/faddle.csv",stringsAsFactors=FALSE)

rather than:

 any_name:
    command: read.csv(any_name)
``

use cases

@richfitz don't know if you're looking for user cases to test with, but this could be one (recent paper by Matt Helmus etal. ): https://github.com/mrhelmus/Helmus_Nature_2014_AnthIB

Script/source based targets?

This may be against the spirit of maker, but in struggling to move from a workflow with a series of scripts/chunks to maker, where everything is a function. I would rather more easily re-use short scripts.

I think it may make sense for a target type to be source or environment, in which rather than a command, the target has a source command, and the target is a saved environment which is returned after the file is sourced. Targets depending on this target would then load that environment. e.g.

targets:
  processed_data:    
    source: processdata.R # The process_data target would be a saved environment

  firstplot:
    source: makeplot.R
    depends: processed_data # The process_data environment would be attached before running makeplot.R

  anobject:
    command: my_cmd()
    depends: process_data  # my_cmd() would be run in an environment with processed_data loaded. 
                            # If my_cmd takes non-target arguments (See #17), these should be able to 
                            # draw on objects in the processed_data environment.

Option to save exported script to file

Is it possible to add a feature to

m <- maker::maker()
m$script()

so that can save script directly to file, similar to writeLines

writeLines <- function (text, con = stdout(), ...) 
{
    if (is.character(con)) {
        con <- file(con, "w")
        on.exit(close(con))
    }
 ...
}

Download file fails if downloader not installed

See: traitecoevo/phyndr-ms#19 (comment)

Error message is very poor.

Would be solved if remake just Imported downloader rather than suggests.

cc: @mwpennell

More refinement in print levels

Currently we have:

verbose: always/never runs print_message
quiet / target_quiet: suppresses target use of message, print and cat.

Some more refinement will eventually be useful:

Only print targets that are actually built (i.e., skip targets showing "OK")
Skip printing for fake targets
Suppress warnings from targets

For big projects, it won't really be useful to see a billion OK's roll past; perhaps these should be pruned to branches that have had anything remade on them?

There are bound to be others, and I'll add them here until I have an idea for the scope of possibilities before trying to define an interface around it.

Flag some packages as eligable for code hashing

This will be the case when using code for a package under development: code hashes should descend into that package.

Active binding functions cause havok with dependency checking

Because of trying to avoid hitting active binding functions and triggering builds, we miss function active bindings, which are fine, and the full dependency set of functions is missed. This is not good.

Determine rule/args from command

Replace things like

  growth_mortality_traits:
    depends: download/wright_2010.txt
    rule: process_wright_2010

with

 growth_mortality_traits:
    command: process_wright_2010("download/wright_2010.txt")

or more generally

target:
  depends: [foo, bar]
  rule: func

with

target:
  command: func(foo, bar)

Retain, at least for now the current depends: / rule: interface. An explicit depends will be needed for fake targets anyway.

The inverse problem to target_argument_name appears: how to pass in the name of the target into the function. Doing that manually is of course possible:

target.csv:
  command: f("target.csv")

But better would be to allow . (following dplyr) and/or more explicitly:

target.csv:
  command: f(target_name)

language independent use cases?

It looks like it would be straightforward to use maker for working with non-R scripts as well; e.g. using commandline tool version (have you looked at docopt for that btw? It has an R package too). If/when this works with other languages, might be nice to add an example or link to the README so that folks don't think it's just for R?

list_targets to ignore all target files

I'm trying to debug something with my knitr file and I need to run knit interactively.

To make sure all my targets are available, I'd like to do something like:

e <- make_environment(list_targets())

This however doesn't quite work because list_targets also returns the target files.

Would it make sense to add an argument to list_targets so it only returns R object targets?

I'm happy to submit a PR if you think it would be useful.

download rules

Downloading files to start an analysis is likely to be fairly common.

@sckott found a cool makefile rule that uses curl to check if a URL has likely changed content: https://ryanfb.github.io/etc/2015/02/25/url_dependencies_with_make.html

Implementing this in R should be easier than the convolutions in make, especially as we have the RCurl package and things like httr. Ideally it would be possible to write

targets:
  filename:
    download: http://whatever.com/path

and apply the recipe above. In addition, we should probably not freak out if the network is not available (I am currently in Australia), and just give a warning that the check was not done.

Plot targets don't rebuild if options change

Auto generate document describing workflow from remake script.

Here's an idea, let's say you have a maker script and you want to write a document describing your analysis. Wouldn't it be nice if we could somehow generate that directly from the maker script.

It might include

names and type of objects created (vector, data.frame, list, file)
names of functions called
descriptions of either objects or functions, taken from roxygen description of function, or by parsing any commented preceding function
characteristics of standard object types, e.g. for data.frame report dimensions, names of columns, for lists report length.
dependency diagrams

The benefit of this approach would be that your documentation goes write along side your code, in form of comments in maker file or sourced R files. Being in the same file as the code, it's more likely to stay current.

Happy to brain storm some more about this if you think it's worth pursuing.

Tabular dependencies

I would like to define a map-like dependency so that the map function is only called on new input.

As an example, say we have code to get some tabular data:

load_table <- function(i) data.frame(idx=1:i, lower=letters[1:i])

and we have dependencies that perform a calculation on each row:

input: load_table(4)
output: df %>% transmute(upper=toupper(lower))

I'd like to be able to specify this dependency so that if the input changes

input: load_table(5)

our transformation is only performed on the new rows (only one call to toupper with "e" as input in this example). Is there a way to accomplish this with remake?

Templating within a remakefile

This is something suggested by @cboettig, and which I immediately ran into on dfalster/tree-p#9 (private repo currently). We have a makerfile that includes targets like:

  output/leaf_traits.csv:
    depends: leaf_traits
    rule: export_csv
    target_argument_name: filename

  output/growth_mortality_traits.csv:
    depends: growth_mortality_traits
    rule: export_csv
    target_argument_name: filename

It might be nice to be able to remove the duplication in a couple of places.

Most simply, the output directory (ideally there won't be that many file targets in a maker workflow, but repetition is bad form). The simplest option I can think of there is to use whisker, so that we'd have:

  {{output_dir}}/leaf_traits.csv:
    depends: leaf_traits
    rule: export_csv
    target_argument_name: filename

  {{output_dir}}/growth_mortality_traits.csv:
    depends: growth_mortality_traits
    rule: export_csv
    target_argument_name: filename

and then another section in the makerfile:

variables:
  output_dir: output

The only real sticking point here is that will fail miserably if a whisker variable is missing because the mustache spec says by default that missing variables result in the empty string. This issue suggests that throwing errors might be a possibility.

A more complicated form of templating (which is then more prone to odd corner cases) would be to define whole template rules. So we'd have:

output/{{filename}}.csv:
    depends: {{object}}
    rule: export_csv
    target_argument_name: filename

somehow fill that in for the two cases above.

Of course, it should be fairly easy for users to manually template their own files prior to running maker, so the simplest solution might be to trial some forms of templating outside the package and incorporate what works.

Quiet targets should print all output on error perhaps

Would address concerns brought up by @aammd here and by @dfalster here

Interactive interface (avoiding yaml)

It should be possible to run maker directly from an .R file, rather than going through yaml: all the yaml is doing is holding lists of things. Something like

m$add_target(foo <- command(bar))

maker::add_target(m, foo <- command(bar))

should work fairly well and don't look too horrible (both include the assignment symbol which I think is probably important). Doing this could also allow for loops over files, etc, but that will require some extra glue to do variable substitution.

We could also arrange to do this in "global mode" where creating a target will also create either a delayed assignment or active binding function.

List/array targets

Implement parts of the split/apply/merge pattern.

Simplest to imagine for object -> object targets. Start with an object that has a length and apply a rule to each element in the list. Merge goes the other way.

For object -> file, each element could become a file within a directory, but getting the file naming is going to be tricky.

For file -> object we could go file in directory to object, fairly easily.

For file -> file we could go from one directory to another.

Problem when sourcing entire directories in export script

if your'e maker file contains something like

sources:
  - R

and then you run

m <- maker::maker()
m$script()

the output contains

source("R")

which doesn't work. Instead need something like

tmp <- lapply(list.files("R", full.names=TRUE), source)

or use a source_dir command

allow knitr targets to depend on target aliases

Maybe I'm not thinking about this correctly, but It seems that it would be useful to support dependencies "aliases" in the knitr targets to avoid repetition. See processed_data in the example below.

sources:
  - code.R

targets:
  all:
    depends: plot.pdf

  processed_data:
     depends:
        - processed_data_1
        - processed_data_2
        - processed_data_3

  data.csv:
    command: download_data(target_name)

  processed_data_1:
    command: process_data("data1.csv")

  processed_data_2:
    command: process_data("data2.csv")

  processed_data_3:
    command: processed_data_alt("data1.csv")

  plot.pdf:
    command: myplot(processed)
    plot: true

  report.md:
    depends:
      - processed_data
    knitr: true

LaTeX target via pandoc

See here (private repo)

segfault bug for linux builds

https://travis-ci.org/USGS-R/geoknife/builds/63045728

Add function (remake_source) that loads sources and objects into global environment

Should be pretty straightforward.

File targets can't be directories

This is currently by design, but will need to be dealt with. Need to distinguish between cases where directories represent list targets (#8) to when they are something like an unpacked shapefile and pointing at a single file makes no sense.

For the first case, just need to implement list targets :). For the second case, need to recursively hash the whole directory I think. That has the potential to be very slow though, so probably worth holding off until different hash options are available.

better warning when yaml not indented properly

If one of lines in maker.yml has incorrect indenting this causes an error. But there's little pointing to actual cause in current error message, unless you know what you're looking for

Error in yaml::yaml.load(string, handlers = handlers) : 
  Parser error: while parsing a block mapping at line 4, column 1did not find expected key at line 30, column 2
Calls: <Anonymous> ... read_maker_file -> yaml_read -> yaml_load -> <Anonymous> -> .Call
Execution halted

Garbage collection

Needs adding or the .remake directory gets kind of large.

Code changes don't trigger downstream targets

So maybe I've got the wrong idea about how this is supposed to work, but here's my situation. I'm building up a remakefile by writing functions (rules) as I go, checking my results and debugging with

m <- remake(envir=.GlobalEnv)

But when I decide to edit a function that is farther upstream, the downstream targets don't get remade. For example:

sources:
  - funs.R

targets:

  targ1:
    command: fun1()

  targ2:
    command: fun2(targ1)

In this example, fun1() and fun2() are defined in funs.R. I've found that if I run remake(), edit fun1 and then rerun remake(), the target targ1 is in the same state as it was before my edit to fun1().

From this, I'd infer that the functions and/or the scripts that are sourced to create them are not treated as dependencies of the targets they create. I worked around this with a fake target:

  code: 
    depends: funs.R

But maybe that is silly?

make_script is generating script that does not load packages

dependencies for specific tasks

this may not be important but one issue to consider is whether dependencies should only be required for the tasks for which they are needed.

for example, in the model adequacy project we may want to build the dataset

m <- maker$new()
m$make("data")

however, this requires that the packages gridExtra be installed to run even though this package is only used for the plotting functions downstream and not required to actually build the data. not sure how to fix it but it might be a nice feature so that the workflow is more modularized.

Print output only on error

Related to #26, have an option for targets to print only on error. Perhaps that's the same thing?

This now exists in callr for system calls, but as we move to use callr we could use the same approach perhaps. Or a logging approach would work.

Better rendering of dependency plots ("$diagram()")

The current code is just a stub, really and won't work for real networks. For example here is an issue from a nontrivial project (dfalster/tree-p -- currently private)

and here's another from richfitz/modeladequacy:

I think the idea is sound, but dealing with networks that are a reasonable size in a moderately automatic way is going to be hard! Especially as the target names for filenames can be quite long.

More verbose output of `is_current`

In the case where an object is not current, allow reporting of why is not current.

richfitz / remake Goto Github PK

remake's Introduction

👋 Hi, I'm Rich

remake's People

Contributors

Stargazers

Watchers

Forkers

remake's Issues

Recommend Projects

Recommend Topics

Recommend Org