factual / drake Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 112.0 827 KB

Data workflow tool, like a "Make for data"

License: Other

Clojure 88.37% D 0.37% Makefile 2.82% Shell 8.42% DTrace 0.03%

drake's People

Contributors

Stargazers

Watchers

Forkers

souravzzz c0mpsc1 christopherdebeer chrismetcalf jinshen-cn githubbridge branky lemonhall reckbo thethirdwheel aboytsov guoyunsky brianm poemcao dmnpignaud howech civitaslearning dchapsky iamedwardshen nadai biocowboy guillaume bluelava stanistan marshallshen gtuckerkellogg derenrich ronswanson64 arowla bluegnu ash211 nvdnkpr morrifeldman hellcoderz suzker wwwtravel trentonstrong nivertech chauncey-garrett spencerx calfzhou amalloy yilab jbn manboubird randomeffect vectart-com andrew-christianson lichia kcandrews imclab sjackman chen-factual williamlao nitorcreations asmunduhreinn yoshw silky bobohuang bahulneel alfredp fxcebx snazz2001 xxpanda omaranwarny seifer08ms dalloliogm fadibashir arinbasu yixf-self jpetimar kranjeto seregasheypak pariyat glowdb nchase vishalinvincible brandonliang kevinwkc artemzi shayrynen yencarnacion robtheoceanographer connectthefuture msmoving corlin zcdr kiraxie17 gcdr kseniakruglova tresata-opensource sesas nickengland parzivalwins teresy jguhlin hwestbro kai1819 mmmika dirten

drake's Issues

Consider a self-install script

...similar to lein, for example.

http://news.ycombinator.com/item?id=5118551

Handle conflicting command-line options

Some options conflict with each other and cannot be used together. Should be easy to handle by adding something like:

(def confiicting-options [  
  ;; scalar - neither of those can be used together
  #{:preview :print}                                      
  ;; tuple - neither one on the left can be used with one on the riht
  [ #{:help :version} #{:branch :vars :auto} ]          
])

output file is left standing on an error'd step

This is a problem if Drake is rerun after a step hard crashes in the middle of writing output. Drake will think the error'd step actually completed (since there's a recent output file).

Best solution may be that, for all in-process output files, we write to a temp output file, then mv it to final output file when full success.

Make sure filenames do not contain '#'

...as well as branch names. Handle gracefully.

support off-base files, BASE override

And also make sure trailing '/' in BASE doesn't matter for steps or for target matching.

Cygwin/Windows support

Perhaps I'm the first one to try running on Windows? It looks like filename processing isn't going well when I run under Cygwin:

% cat `which drake`
#!/usr/bin/bash
java -cp `cygpath -w ~/Downloads/drake.jar` drake.core $@

% cygpath -w $PWD
C:\Users\me\mydirectory

% drake --version
Drake Version 0.1.0

% cat workflow.d
startdat.csv <- [R]
  x <- runif(10)
  write.csv(data.frame(x=x))

% drake
The following steps will be run, in order:
  1: startdat.csv <-  [missing output]
Confirm? [y/n] y
Running 1 steps...
Invalid filename: file:C:\Users\me\mydirectory\startdat.csv

According to http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx, the proper URI for that path would be file:///C:/Users/me/mydirectory/startdat.csv.

Individual $BASE for every filesystem

We should probably extend the notion of $BASE for every filesystem. It's convenient to have a separate working directory on the local filesystem as well as on HDFS, for example.
Something like that maybe:

Global BASE can be specified along with filesystem-specific BASEs
If the filename starts with /, BASE is not used, regardless of whether the filesystem prefix is given. If no filesystem prefix is given, the default filesystem is used (now local, but we can also add a command-line flag to specify which).
Otherwise:
1. If the filename has a filesystem prefix, global BASE is ignored and only filesystem-specific BASE is looked for. If not given, it's an error for filesystems which don't have the notion of current directory (e.g. HDFS). For local filesystem, the file is relative to the directory of the master workflow file.
2. If the filesystem prefix is not given, either global BASE or the default filesystem's BASE is used. If both are specified, it's an error.

Example:

hdfs:BASE=/tmp
file:BASE=/tmp

hdfs:a <- file:b       ; hdfs:/tmp/a <- /tmp/b
hdfs:a <- b            ; hdfs:/tmp/a <- /tmp/b
a <- b                 ; /tmp/b <- /tmp/a

file:BASE=
BASE=s3:/tmp
a <- b                 ; s3:/tmp/a <- s3:/tmp/b
a <- /b                ; s3:/tmp/a <- /b

hdfs:/a <- s3:/a       ; hdfs:/a <- s3:/a
hdfs:/a <- /a          ; hdfs:/a <- /a

file:BASE=/tmp
a <- b                 ; Error, ambiguous: s3:/tmp/a or file:/tmp/a?
/a <- /b               ; /a <- /b

support for wildcard inputs and outputs

Feature request.

Defining BASE with := is ignored.

;This works
BASE=hdfs://user/alexr/resolve-ml

;This appears to be completely ignored.
BASE:=hdfs://user/alexr/resolve-ml

The user manual recommends doing the second method, so it can be overridden from the command line.

StackOverflowError

When I run w/ my large workflow.d file I get a stack overflow exception. If need be, I can send/post the workflow.d file.

Exception in thread "main" java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.flatland.drip.Main.invoke(Main.java:117)
  at org.flatland.drip.Main.start(Main.java:88)
  at org.flatland.drip.Main.main(Main.java:64)
Caused by: java.lang.StackOverflowError
  at clojure.core$concat$fn__3804.invoke(core.clj:662)
  at clojure.lang.LazySeq.sval(LazySeq.java:42)
  at clojure.lang.LazySeq.seq(LazySeq.java:60)
  at clojure.lang.RT.seq(RT.java:473)
  at clojure.core$seq.invoke(core.clj:133)
  at clojure.core$concat$fn__3804.invoke(core.clj:662)

...

Unable to compile jar from cloned repo

I encounter the following error message of a missing dependency when I try to run lein uberjar, as instructed in the README:

: Missing:
----------
1) com.google.oauth-client:google-oauth-client:jar:${project.oauth.version}

  Try downloading the file manually from the project website.

  Then, install it using the command:
      mvn install:install-file -DgroupId=com.google.oauth-client -DartifactId=google-oauth-client -Dversion=${project.oauth.version} -Dpackaging=jar -Dfile=/path/to/file
...
  Path to dependency:
        1) org.apache.maven:super-pom:jar:2.0
        2) com.google.api-client:google-api-client:jar:1.8.0-beta
        3) com.google.oauth-client:google-oauth-client:jar:${project.oauth.version}

I've tried to manually download the jar from here, but to no avail. Any ideas how I might solve this issue?

Thanks

Hook to detect dataset changes (MD5, etc...)

Rather than forcing an update when you know there is a change to a data source, it would be nice if it could use a hook to detect that automatically.

If using a database, you could use various methods to detect changes -- with postgres, perhaps using the WAL position would be enough, and that has no overhead.

Dot graph generated from workflow files

If a dot graph could be generated from a workflow file, then, using graphwiz/gephi, we could get an intuitive image of the workflow

Automatic filename generation and making Drake even more cool

Would like to hear everyone's thought on this one.

Design, spec out and implement automated filename generation for cases where filenames are not important. We can use _ symbol to specify it. The filenames would still be persistent - they should be a function of information in the step, for example (probably in that order), the method used, other (named) outputs, tags used, or step's numeric position (worse). Even though these scheme can never guarantee changing the workflow wouldn't change the filenames, we should try to minimize these cases. Example:

_ <- input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_ <- _
  sort $INPUT > $OUTPUT

output <- _
  uniq $INPUT > $OUTPUT

Or in combination with methods:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

uniq()
  uniq $INPUT > $OUTPUT

_ <- input            filter()
_ <- _ [retries:5]    sort()
_ <- _ [my-option:66] uniq()
output <- _           filter()          ; can be used several times, why not?

It is mostly useful for very simple relationship (single input, single output), but can be used in a more complicated context as well:

output1, _ <- input        ; two outputs, don't much care about naming of the second one
   ....

_ <- _
   ....

result <- output1, _       ; referring to the output1 directly
   ....

We could even add a special symbol (+) as a shortcut for (_ <- _):

+
  grep -v BAD_ENTRY $INPUT > $OUTPUT

+ 
  sort $INPUT > $OUTPUT

+ 
  unique $INPUT > $OUTPUT

And if we relax requirement for each step to begin with a new line (which is only important when the body is defined), in combination with methods we could arrive at the following equivalent:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

unique()
  unique $INPUT > $OUTPUT

+ filter + sort + unique

And we can also introduce some rules that the very first input _ is replaced with $in environment variable, and the very last output _ with (optional) $out environment variable, then the script above could be invoked as:

drake -v in=my_input,out=my_output

and we can use Drake to create quick ad-hoc data processing pipelines without caring about naming intermediate data files.

For truly temporary files that should be deleted, we can use _?. The benefits of this is less obvious, because if the file is truly temporary, Drake will always run steps linked through such files together (there would never be a state where only one of them is up-to-date). It could still be convenient if you want a temporary file anyway, just want something else (Drake) to take care of its creation and deletion.

+1 if you like. Your feedback is appreciated.

"#" in branch hdfs path gets encoded if path already exists, causing hadoop rm command to fail

When using the --branch option and hdfs paths, it looks like the "#" symbol is encoded under the shell protocol if the path already exists, causing the hadoop rm command to fail.

Given the following workflow.d:

BASE=hdfs:/user/$[USER]/tmp

input <- !in.csv
  hadoop fs -rm -r $OUTPUT
  hadoop fs -copyFromLocal $INPUT $OUTPUT

run:
$ drake --auto --branch test +...

Running 1 steps...

--- 0. Running (forced): hdfs:/user/raronson/tmp/input#test <- in.csv
rm: `hdfs:/user/raronson/tmp/input#test': No such file or directory
--- done in 2.89s

Done (1 steps run).

$ drake --auto --branch test +...

Running 1 steps...

--- 0. Running (forced): hdfs:/user/raronson/tmp/input#test <- in.csv
rm: File does not exist: hdfs://namenode/user/raronson/tmp/input%23test
copyFromLocal: `hdfs:///user/raronson/tmp/input%23test': File exists
drake: shell command failed with exit code 1

Confirm that we need all the dependencies we pull

On the first look, it seems like we're pulling quite a lot. Do we really need all of those or some are just dead weight?

revisit filename parsing

escaping, whitespace, all the fun stuff that no one cares about... until someone does

include build information when uberjar'ing

Should include things like timestamp, branch, user, etc.

This is useful for tracking versions, e.g. print it out via --version

Consider using () for method invocation

Need your feedback, guys. I'm thinking we should have an alternative way for invoking methods just by using () similar to how we match methods in target selection from the command-line, e.g.:

my-method() [eval]
  ...

; Not just this:

output <- input [method:my-method some-option:55]

; But also this:

output <- input [my-method() some-option:55]

Why not?

revisit no-output targets and -check

Artem: "No-output targets are not run by default unless [-check] is used. This is logically consistent, but a bit inconvenient. We might need to rethink it, if a lot of people are confused with that. The default running behavior of no-input and no-output steps are described in the spec. Would love to hear your thoughts on it."

BUG: Long step defintion -> filename too long (linux fc 13)

As an example, this step fails for me:

dwi_found, t1_found, t2_found, flair_found, swi_found  <- find_dicom_folders

with

BASE=/tmp/case_id_020_20130104_1_23_2013_14_20_6.zip.dir

java.io complains "(File name too long)" when trying to write to .drake/.

(2.6.34.9-69.fc13.x86_64 #1 SMP Tue May 3 09:23:03 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux)

Add a step option to require force-rebuild

We should probably have an option that would require a force-rebuild, i.e.:

output <- input [+force]
   ...

which could be equivalent to specifying "force" evaluator (see #11):

output <- input [evaluator:force]
  ...

Also, it seems like timecheck and check option names could be a bit confusing, since one might assume they accomplish exactly that.

Can't pull dependencies

I'm new to Drake and Clojure and Leiningen, so I'm not sure how to troubleshoot this. Here's what happens when I try to build Drake:

% lein deps
Could not transfer artifact clj-logging-config:clj-logging-config:pom:1.9.6 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact fs:fs:pom:1.3.2 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:jlk-time:pom:0.1 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact digest:digest:pom:1.4.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact slingshot:slingshot:pom:0.10.2 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:fnparse:pom:2.3.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:sosueme:pom:0.0.15 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:c4:pom:0.0.8 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact hdfs-clj:hdfs-clj:pom:0.1.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
This could be due to a typo in :dependencies or network issues.
Could not resolve dependencies

Does this just mean that https://clojars.org is down? If so, is there anything I can do about it, e.g. grab stuff from another server?

File-level evaluators

Related to #30, #11.

We might consider specifying evaluators on individual file (group) level rather than the step level. Use cases:

Some file should be included as step input, but should not be evaluated
Same about output - some side-effect output is created (report?) but should be ignored as far as the evaluation goes

These can be solved by excluding the files from the list of inputs/outputs and hardcoding their names into the step's body, but it complicates workflow management and goes against Drake's philosophy. Also:

Combination of timestamp and MD5 evaluators: should rebuild if the output is older than the input OR the input's checksum has changed

Proposal:

Specify evaluators for any combination of named inputs/outputs (#39).
Specify evaluator groups by using prefixes - all filenames starting with this prefix would share the same evaluator group and the same evaluator.
Files can be part of multiple evaluators.
The default evaluator is applied to the remaining (not named) files.
The end result is an OR of all evaluators used.

Example:

a, b <- c [eval:timestamp] 
  ; Standard ("timestamp") evaluator is called on 2 outputs and 1 input

a, b <- c [eval:md5]
  ; MD5 evaluator is called on 2 outputs (which it ignores) and 1 input which it 
  ; verifies for MD5 change

a, b <- c, d(x) [eval:ignore(x)]
  echo $x        # "d"
  ; Built-in "ignore" evaluator, which always returns false, is called with 0 outputs 
  ;   and 1 input
  ; Standard ("timestamp") evaluator is called on remaining 2 outputs and 1 input

a, b <- c, d(x1), e(x2) [eval:ignore,md5(x)]
  echo $x1    # "d"
  echo $x2    # "e"
  ; MD5 evaluator is called on 0 outputs and 2 inputs
  ; Remaining 2 outputs and 1 input are processed through "ignore" evaluator 
  ;   and ignored

a(t) <- b(t), c(t,x) [eval:timestamp(t),md5(x)]
  echo $x      # "c"
  echo $t      # "a b c"
  ; MD5 evaluator is called on 0 outputs and 1 input
  ; Timestamp-based evaluator is called on 1 output and 2 input
  ; The step will run either if c's checksum has changed, or if b, c or both are 
  ;   fresher than a

We can also add syntactic sugar to specify evaluators directly in filenames without assigning them variables:

a <- b, c(eval:m5)
  ; MD5 will be run on c, a and b will be compared by timestamps

I'm sure there's more to it and I've just scratched the surface. For example, options that alter behavior of evaluators (check, timecheck, and #38) should be applied to groups instead. One idea is to get rid of these options altogether, but specify different evaluator flavors instead, which could be more consistent.

--version should halt

One user reported this behavior:

drake --version
Drake Version 0.1.0
Target not found: ...

This is weird. --version should not try to run any tagets.

stdout and stderr of task is being interleaved

This is what it should look like (when I run it standalone):

Hist('p0iso_tot') ph_analysis_isolation[0] / 1e3
Traceback (most recent call last):
  File "./histograms.py", line 69, in <module>
    main()
  File "./histograms.py", line 66, in main
    make_iso_plots(t, 0, sel)
  File "./histograms.py", line 35, in make_iso_plots
    print hist, "mgg", sel(*s)
NameError: global name 'hist' is not defined

This is what I get from drake (it's reproducible):

--- 0. Running (missing output): histograms.root <- data.root, histograms.py
�T[r?a1c0e3b4ahcHki s(tm(o'spt0 irseoc_etnott ')c aplhl_ alnaasltys):i
s _ isFoillaet i"o.n/[h0i]s t/o g1rea3m
s.py", line 69, in <module>
    main()
  File "./histograms.py", line 66, in main
    make_iso_plots(t, 0, sel)
  File "./histograms.py", line 35, in make_iso_plots
    print hist, "mgg", sel(*s)
NameError: global name 'hist' is not defined
drake: shell command failed with exit code 1

Somehow the output is getting interleaved. I also see unprintable characters on the terminal.

resources/regtest/test_jar/ is missing from Drake repository

Probably disappeared when we migrated the repo from D to Drake. It is required by regtest_protocol_eval.

Aaron, could you also please re-run the whole regtest suite with run-all.sh? I don't have HDFS access.

address slow startup time

Nailgun won't work for multiple runs, unless we use --auto to avoid cli interaction. Related to how we're dealing with stdin.

java.lang.NullPointerException
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:296)
at d.core$user_confirms_QMARK_.invoke(core.clj:54)

That's where we call read-line-stdin to get user confirmation. The first run under nailgun works fine. Any next run will immediately get a nil returned from read-line-stdin. As a test, i tried calling read-line-stdin multiple times at that stage, and all calls get a nil back immediately, w/o the user ever having the chance to enter anything.

Update the spec with --print and --preview

Not documented now.

Cannot get drake to build due to missing jlk/time dependency

I am trying to build drake behind a firewall that has a Sonatype nexus proxying Clojars, Central, et al. For some reason, I can build the uberjars and run drake fine, but it barfs when I try to start a repl:

Could not find metadata jlk:time:0.1-SNAPSHOT/maven-metadata.xml in clojars-snapshots (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in Internal central (http://hsdgrnbrg.XXXX/nexus/content/repositories/central)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in Internal clojars (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in clojars-snapshots (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in internal-nexus (http://hsdgrnbrg.XXXX/nexus/content/repositories/releases)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in foursquareapijava (http://foursquare-api-java.googlecode.com/svn/repository)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in Internal central (http://hsdgrnbrg.XXXX/nexus/content/repositories/central)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in Internal clojars (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in clojars-snapshots (http://hsdgrnbrg.XXXXnexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in internal-nexus (http://hsdgrnbrg.XXXX/nexus/content/repositories/releases)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in foursquareapijava (http://foursquare-api-java.googlecode.com/svn/repository)
Check :dependencies and :repositories for typos.
It's possible the specified jar is not in any repository.
If so, see "Free-floating Jars" under http://j.mp/repeatability
Exception in thread "Thread-1" clojure.lang.ExceptionInfo: Could not resolve dependencies {:exit-code 1}
    at clojure.core$ex_info.invoke(core.clj:4227)
    at leiningen.core.classpath$get_dependencies.doInvoke(classpath.clj:128)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.RestFn.applyTo(RestFn.java:132)
    at clojure.core$apply.invoke(core.clj:605)
    at leiningen.core.classpath$resolve_dependencies.doInvoke(classpath.clj:144)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at leiningen.core.eval$prep.invoke(eval.clj:60)
    at leiningen.core.eval$eval_in_project.invoke(eval.clj:220)
    at leiningen.repl$start_server.doInvoke(repl.clj:65)
    at clojure.lang.RestFn.invoke(RestFn.java:470)
    at leiningen.repl$repl$fn__1788.invoke(repl.clj:145)
    at clojure.lang.AFn.applyToHelper(AFn.java:159)
    at clojure.lang.AFn.applyTo(AFn.java:151)
    at clojure.core$apply.invoke(core.clj:601)
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1771)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.RestFn.applyTo(RestFn.java:132)
    at clojure.core$apply.invoke(core.clj:605)
    at clojure.core$bound_fn_STAR_$fn__3984.doInvoke(core.clj:1793)
    at clojure.lang.RestFn.invoke(RestFn.java:397)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:722)

Named input and output files

The way it stands now, multiple inputs are put into INPUT1, INPUT2, ... etc. which is convenient for simple steps, but can get complicated with more than a couple inputs, and also makes editing and re-using steps' code harder. It would be nice if users were able to give steps' files names. Named files will be excluded from automatic (INPUTX) variables and be put in separate environment variables. Several steps can share the same name - then they're concatenated via space and put into one variable. Example:

a(y), b(x), c, d <- e(y), f(z), g(z), h
  echo $INPUTN       ;; "1"
  echo $INPUT1       ;; "h"
  echo $OUTPUTN      ;; "2"
  echo $OUTPUT1      ;; "c"
  echo $OUTPUT2      ;; "d"
  echo $x            ;; "b"
  echo $y            ;; "a e"
  echo $z            ;; "f g"

Parse error when filenames contain equals character

Here's a minimal example:

spain.today <- spain.dt=2013-01-29.in
    cat $INPUT >$OUTPUT

You might think having an equals sign in a filename is rare, but it's how Hive names each folder for a partition.

I couldn't find a way to escape the equals character. My natural guesses, using a backslash or enclosing the whole filename in quotes, didn't work.

No blank lines allowed in code blocks

I've been working with Drake on a few projects, and ran into this issue. This works:

%hello <-
    x=1
    echo $x

but not this

%hello <-
    x=1

    echo $x

Adding that newline gives a largish syntax error when we run drake -a %hello

java.lang.IllegalStateException: drake parse error at line 4, column 1: Illegal syntax starting with "EOF" for workflow
    at drake.parser_utils$throw_parse_error.invoke(parser_utils.clj:47)
    at drake.parser_utils$illegal_syntax_error_fn$fn__3010.invoke(parser_utils.clj:66)
    at drake.parser$parse_state$fn__787.invoke(parser.clj:594)
    at name.choi.joshua.fnparse$rule_match.invoke(fnparse.clj:433)
    at drake.parser$parse_state.invoke(parser.clj:590)
    at drake.parser$parse_str.invoke(parser.clj:600)
    at drake.parser$parse_file.invoke(parser.clj:605)
    at drake.core$with_workflow_file.invoke(core.clj:456)
    at drake.core$_main.doInvoke(core.clj:659)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at drake.core.main(Unknown Source)
java.lang.IllegalStateException: drake parse error at line 4, column 1: Illegal syntax starting with "EOF" for workflow
    at drake.parser_utils$throw_parse_error.invoke(parser_utils.clj:47)
    at drake.parser_utils$illegal_syntax_error_fn$fn__3010.invoke(parser_utils.clj:66)
    at drake.parser$parse_state$fn__787.invoke(parser.clj:594)
    at name.choi.joshua.fnparse$rule_match.invoke(fnparse.clj:433)
    at drake.parser$parse_state.invoke(parser.clj:590)
    at drake.parser$parse_str.invoke(parser.clj:600)
    at drake.parser$parse_file.invoke(parser.clj:605)
    at drake.core$with_workflow_file.invoke(core.clj:456)
    at drake.core$_main.doInvoke(core.clj:659)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at drake.core.main(Unknown Source)

If it's not too hard to support, it'd be great to allow newlines in the code blocks, since they help so much with readability whenever you have a block that's more than a few lines.

factor c4 out of Drake

better suited as an external lib?

Support output and input directories, not just part-????? files

In the form described in the spec, or some other form. Maybe it should just be the default behavior if a directory is specified instead of a file. See also the outstanding comment in the Filenames section.

I'm not sure I have the bandwidth to take it on right now. Any takers? I will gladly review the code.

This feature seems to be required for using Drake with Hive.

HDFS targets show in confirmation even if not needed

Terminal output here: http://pastebin.com/J08GAk1Y

drake has already run once, to completion. No files have been modified. drake correctly notices this as skips all the steps. Why does it still say it is going to do the steps?

All the steps involve at least one hdfs location. A very similar workflow that was all local didn't exhibit this same behavior.

Check for filesystem desynchronization

When running on more than one file system, check that they are sychronized (within some margin). That margin should probably be --step-delay. See #15.

Support passing in "--" args in run-intepreter

We have some interpreters where we'd like to pass several options to them before the script to configure them for running a script off disk, and several options after a -- to allow the user to pass extra options to their script.

I think that adding an :args key to the step map in run-interpreter would be a fine way to communicate this, and then change the (apply shell... command to look like:

    (apply shell (concat [interpreter]
                         args
                         [script-filename]
                         (:args step) ;CHANGE HERE
                          [:env vars
                          :die true
                          :out [System/out (writer (log-file step "stdout"))]
                          :err [System/err (writer (log-file step "stderr"))]]))

Then, it would be up to the particular handler for the language to decide how to support args, whether it needs to include a -- form, and any other of that kind of decision.

Create a JAR for download, update the docs

Drake doesn't need double slashes when referring to HDFS

Our documentation suggests it does. Aaron, can you confirm that it needs it? I'm looking at the regression tests for HDFS and they don't use double slash, but they still run OK.

What's going on here? We either need to fix Drake or fix the docs.

Support parameter passing to methods

At the moment, the only way to get variables/parameters into a method is something like:

my_method() [shell]
  echo $FOO

FOO=bar
my_output <- my_input [method:my_method]

This is really non-obvious and it would be much cleaner to do this in a more standard way such as:

my_method(arg0) [shell]
  echo $$arg0

my_output <- my_input [method:my_method("foo")]

...in this case using $$ to denote a method parameter and not a variable. I think I saw some mention of this in the Google Doc spec.

Support make "suffix rules" (aka template rules)

Assume a bunch of files in a directory whose names all follow the same pattern (for example "[0-9]+.html"). Each file name is basically a string of characters, followed by some dot delimited suffix.

I want to run the same set of steps for all the . delimited files (ie like a metaworkflow). In Make there is the concept of a suffix rule, where you can use a suffix to define a general set of actions to run. For example:
.cc.o:
$(CXX) $(CXXFLAGS) -c $<

tells make how to build .cc.o files (where "$<" is a special macro which stands for ".cc").

Something similar for Drake would be useful.

Checksum-based dependency evaluation

There was a request for this.

Related to #11 (support general evaluator hooks).

It could be baked into Drake directly. This evaluator would ignore the timestamps of the input and output files, and only re-run the step if MD5 of the step's inputs have changed since the last run. The MD5's would probably have to be created alongside the files (input.md5-drake or something like that), and will need to be moved/renamed with the files when branching, backups, etc. Forced rebuild should probably update the MD5s?

HDFS file existence check failing

I'm using Hadoop 1.0.3 on an AWS Elastic-MapReduce cluster. I compiled drake with [org.apache.hadoop/hadoop-core "1.0.3"] set in project.clj.

When I set about trying to write this minimal recipe to copy a file to HDFS, it doesn't recognise that the file exists after the first successful run.

hdfs:///user/hadoop/myfile.txt <- myfile.txt
    if hadoop fs -test -e $OUTPUT; then
        hadoop fs -rm $OUTPUT
    fi
    hadoop fs -put $INPUT $OUTPUT

So, no matter how many times this is run, each run gives:

The following steps will be run, in order:
  1: hdfs:///user/hadoop/myfile.txt <- myfile.txt [missing output]
Confirm? [y/n]

Even though the output is there. This looks related to #15

Timestamps appear to be reliable and in sync. That is, HDFS is reporting the timestamp of the output as being fresher than the input (at least, on the command-line).

hadoop@hadoop-master:~$ hadoop fs -ls myfile.txt
Found 1 items
-rw-r--r--   2 hadoop supergroup          6 2013-01-30 01:16 /user/hadoop/myfile.txt
hadoop@hadoop-master:~$ ls -l myfile.txt
-rw-r--r-- 1 hadoop hadoop 6 Jan 30 01:05 myfile.txt
hadoop@hadoop-master:~$ drake -a
Running 1 steps...

--- 0. Running (missing output): hdfs:///user/hadoop/myfile.txt <- myfile.txt
Deleted hdfs://10.117.143.22:9000/user/hadoop/myfile.txt
Step Duration Secs: 11

Done (1 steps run).

Any suggestions for a workaround?

Support wildcard inputs & outputs (globbing)

An example is if you had a directory structure like logs/year/month/part-files and wanted to only process jan - mar from every year. the pattern would be logs/*/0[1-3]

Include should work on lexical level

See description at the end of the designdoc. Same applies to $(...).

Switch to 1ms timestamp resolution?

It seems that at least on OS X, we're using 1s timestamp resolution. We're requesting in milliseconds, but getting back only thousands in return (see all numbers end with 000):

Timestamp checking, inputs: [{:path "/tmp/drake-test/hdfs_1", :mod-time 1359528216000, :directory false} {:path "/tmp/drake-test/hdfs_2", :mod-time 1359528237000, :directory false}], outputs: [{:path "/tmp/drake-test/merged_hdfs", :mod-time 1359528242000, :directory false}]
Newest input: 1359528237000, oldest output: 1359528242000
Running 2 steps...
Timestamp checking, inputs: [{:path "/Users/artem/drake/resources/regtest/local_1", :mod-time 1359528216000, :directory false} {:path "/Users/artem/drake/resources/regtest/local_2", :mod-time 1359528245000, :directory false}], outputs: [{:path "/Users/artem/drake/resources/regtest/merged_local", :mod-time 1359528224000, :directory false}]
Newest input: 1359528245000, oldest output: 1359528224000

I'm pretty sure HFS plus is capable of much higher resolution, so I'm not sure what's going on.

I've added --step-delay flag in feature/vvv branch (ee833c5), to make regression tests pass.

filter-with-grep() [eval]
  grep -v "$CODE" $INPUT > $OUTPUT

output <- input [method:filter-with-grep method-mode:append]
  regexp-matching-bad-entries-to-be-removed

But it should be possible to do this:

filter-with-grep() [eval method-mode:append]
  grep -v "$CODE" $INPUT > $OUTPUT

output <- input [method:filter-with-grep]
  regexp-matching-bad-entries-to-be-removed

In other words, all the step's checks should happen after options and variables are merged with the method's definition, not before.