snowplow / factotum Goto Github PK

View Code? Open in Web Editor NEW

217.0 22.0 12.0 239 KB

A system to programmatically run data pipelines

Home Page: http://snowplowanalytics.com/blog/2016/04/09/introducing-factotum-data-pipeline-runner/

Rust 98.86% Shell 0.31% Makefile 0.82%

dag rust job job-scheduler cron

factotum's People

Contributors

Stargazers

Watchers

Forkers

ntdef stvhanna codeaudit mobalt jamesthesnake greg-el jamessnowplow ljcolling standardgalactic yjcyxky playfloor

factotum's Issues

Add copyright notices to all sourcefile headers

Write DAG as an image

I think the underlying graph crate may support this out of the box

Add converter from Makefile to a Factotum factfile

We need this so we can start migrating makefiles over to Factotum.

$ factotum job make2factotum FILE

where FILE is a minimalistic Makefile, containing only the following elements:

Variables
Rules

Example of a minimalistic Makefile that will be convertable:

pipeline=acme

done: run-web-sql
    /notify-v0.2.0.sh $(pipeline) "Completed successfully"

start:
    /notify-v0.2.0.sh $(pipeline) "Started"
check-lock: start
    /check-lock.sh $(pipeline)
emr-etl-runner: check-lock
    /r73/emr-etl-runner-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(client) "Ran EmrEtlRunner"
storage-loader: emr-etl-runner
    /storage-loader-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(pipeline) "Ran StorageLoader"
run-dedupe-sql: storage-loader
    /sql-runner-0.2.0-lock.sh $(pipeline) dedupe
run-web-sql: run-dedupe-sql
    /sql-runner-0.2.0-lock.sh $(pipeline) web

Add factotum -v / --version

This should display the current binary version (and should handle release candidates too)

Decide how to handle a no-op step

At least to start with, we want to exit with code 0 on the case of a no-op. My question is more: in the case of a no-op, do we behave make -k-style and process all other possible items before exiting ("lazy quit"), or do we just complete anything that was already running and quit ("eager quit"). I suppose the third option is we just kill anything else that is running ("rage quit").

Do we make this configurable on a per job basis?

Send job start/succeed/fail notifications to Slack

Add audit trail for all run jobs, and who ran them

Need to come up with a JSON Schema for this

Add verbose notification mode that alerts on every step

This isn't a very popular feature when we have mocked it up, +1 this ticket if you want it...

Containers / Job provisioning

After having a think about this, there's a couple of options for how to provision a job:

include the provisioning as a prerequisite task in the DAG

this will work with no changes - it's a bit clunky though

add a "provision" step at the root of the project which checks a box is adequately provisioned at the start of a run (a list of ansible playbooks perhaps)
After thinking about 2) I think we'd just as easily be able to specify an optional docker container at the top level of the factfile. Factotum would then download this container and run the jobs from inside the container

Docker containers would be pretty cool since I could send a factfile to anyone with an internet connection (within reason) and they could run the job locally by having Factotum download (and cache) the docker image. Provisioning would be done by building the image - it could also have a prebuilt Factotum at the right version in (much the way sbt works). Docker also allows resource constraints (like allowing only 1 cpu, Xmb mem etc) which would stop us replicating the same later somewhere else.

This ticket is a placeholder - I don't think it's quite ready to roadmap yet.

Support "forward references" of steps in factfiles

Currently a step can only depend on a step which the parser has previously seen. For example;

step a
step b 
step c <-- can only depend on steps b or a, not d
step d

This simplifies the parser logic, but restricts the file layout somewhat. Factotum should also not panic if this kind of event occurs (which as of 0.1.0 it does).

Deployment and setup

As it's open source it's probably prudent to have a cursory look at how we'll do this.

I think some of the good options are;

deb/rpm and a repository / ppa
- apt-get install factotum
bintray / zip of binaries
use cargo / distribute via source
- cargo install factotum

I think 1. is the gold standard where possible, but it carries a bit of overhead and is very linux specific.

Add ability to resume a failed job

Resumption should be stateless - the operator will need to specify which step to resume from.

In the case that a single step is ambiguous, the operator must provide a comma-separated list of steps.

A single step will be ambiguous if that step is on an execution branch which has one or more sibling execution branches. It is ambiguous because Factotum cannot know what progress was made on the sibling branch(es).

Send job start/succeed/fail notifications to HipChat

Placeholder for shell tasks which expand into inner DAGs

This functionality is in support of:

EmrEtlRunner: placeholder for "Factotum inside"
Placeholder for "Factotum inside" for SQL Runner

Pre the above tickets, some command-line tools will provide a command-line option to yield an inner Factotum DAG that can then be run from inside a parent Factotum DAG. This is a key plank of rule 2 of the Zen of Factotum:

A job must be composable from other jobs

Current thinking (feedback welcome) is as follows:

Add a new DAG "task"(?) type - this needs further discussion as Factotum 0.1.0 just assumes that all tasks are tasks (not inner DAGs)
Add support for the DAG task type to be populated at runtime using a shell command (we should futureproof the schema design so that we can also support the simpler use case where an inner DAG is held in a file, see #40, or KV-store)
When Factotum starts to run the job, it will evaluate the shell command which will yield the inner DAG in Factotum DAG format via stdout and return 0
Factotum will validate the inner DAG to ensure that it's a valid Factotum DAG
Factotum will then expand the parent DAG to include the inner DAG at the correct point

In the integration tests for this feature, we should write a shell script which returns a simple Factotum DAG into a parent DAG, to validate that this is working correctly.

See also #40 for the sibling ticket where the simpler case of an inner DAG being retrievable from a static file is considered.

Add vagrant/push support

Support deploys to bintray using vagrant push

Send job fail notifications to OpsGenie

Add $ factotum config make2factotum FILE

Depends on https://github.com/snowplow/makefile-rs

Add version, license, Travis buttons to README

See other projects

Add Vagrant setup

Standard Vagrant setup similar to other projects

Add test coverage indicator / coveralls

This is split off from #25 as it requires a bit of investigation and isn't critical

Placeholder for using a distributed lock for run constraint

The idea builds on #45, but instead of manually specifying boxes, you would just let all boxes attempt to run all jobs they find in their cron, and there would be a distributed lock in e.g. Consul so that only one box is "elected" to run the job.

Add --validate command to validate factfiles

Define actions on steps

Taking a look at #2 each step seems to be a shell execution. Taking a look at airflow one of the nice things they have is the ability to execute different types of jobs. Gradle, Maven, SBT and alike support this via plugins. It'd be pretty cool to have a plugin style system, that could express steps in a more straightforward way.

Support Windows

Windows will require a bit of thought as the shell runner (sh) doesn't exist unless running in cygwin or similar.

Add travis CI

It's quite nice how they've done this with serde, using travis-cargo. Both travis and coveralls are free for open source projects.

Be possible to run a job in an endless loop

So as soon as one run of the job has finished, start the next one.

Need to figure out what should happen on a failure - keep looping or die?

Add validation of factfile

Using https://github.com/snowplow/iglu-rust-client

Add support for variable and assignment substitution

I think @ninjabear and I discussed this last week and agreed that v0.1.0 didn't need this (because I can start testing this with pipelines with hardcoded variables)...

Define job schema

Current working idea:

{
  "schema": "iglu:com.snowplowanalytics.factotum/job/jsonschema/1-0-0",
  "data": {
    "name": "My First DAG",
    "arguments": [ "clientTag" ],
    "assignments": {
      "configDir": "/opt/mt-configs2/{{ $.arguments.clientTag }}",
      "scriptDir": "/opt/mt-scripts/acme"
    },
    "steps": [
      {
        "name": "EmrEtlRunner",
        "type": "shell",
        "command": "{{ $.assignments.scriptDir }}/acme-emr-etl-runner.sh",
        "arguments": [ "{{ $.assignments.configDir }}" ],
        "dependsOn": [],
        "response": {
          "noOp": [ 3 ]
        }
      },
      {
        "name": "StorageLoader",
        "type": "shell",
        "command": "{{ $.assignments.scriptDir }}/acme-storage-loader.sh",
        "arguments": [ "{{ $.assignments.configDir }}" ],
        "dependsOn": [ "EmrEtlRunner" ]
      },
      {
        "name": "SQL Runner",
        "type": "shell",
        "command": "/opt/sql-runner-0.2.0/sql-runner",
        "arguments": [ "--playbook", "{{ $.assignments.configDir }}/sql-runner/playbooks/stage-1.yml", "--sqlroot", "{{ $.aliases.configDir }}/sql-runner/sql" ],
        "dependsOn": [ "StorageLoader" ]
      }
    ]
  }
}

Add support for a SQL Runner playbook as an inner DAG

Imagine this Factfile step:

TODO

And imagine this is the referenced SQL Runner playbook:

TODO

Then Factotum would pre-process this into the following inner DAG:

TODO

Add Bintray credentials to .travis.yml

BINTRAY_SNOWPLOW_GENERIC_USER
BINTRAY_SNOWPLOW_GENERIC_API_KEY

Use serde instead of rustc-serialize

Working with rustc-serialize is a bit tedious, and has a few limitations. We should move to serde once the good codegen support lands in Rust stable.

Think about waits for the future

Send job fail notifications to PagerDuty

Add authentication

Use Vault as the back-end?

Suggest renaming jobspec.rs to factfile.rs

As jobspec sounds like it could be a test (spec) for a job...

Replace hard-coded VERSION file with dynamic lookup from Cargo.toml

This can wait till next release. Here's the Scala/SBT equivalent:

https://github.com/snowplow/huskimo/blob/master/vagrant/push.bash#L54

Placeholder for checking that onResult rules are always MECE

This will need some code as it's inexpressible at the JSON Schema level.

Improve efficency of execution order

The above image on the right displays how factotum executes steps, the numbers and colours indicate what is run with what. In this case the total run-time can be decreased considerably if 4 didn't wait for 2 to complete. This happens because 2 and 3 are in the same running batch, which waits for both to finish.

Example cronfile:

BOX1=box1.acme.internal
BOX2=box2.acme.internal

/opt/factotum-0.4.0/factotum first-dag.factotum --constraint "host,${BOX1}"
/opt/factotum-0.4.0/factotum secnd-dag.factotum --constraint "host,${BOX2}"

To start with, we only allow one constraint, and the only allowed constraint is host, and the implied check is ==.

Use .factfile as file extension in README and all tests and samples

Should factotum facfiles be:

.factotum
.factfile
.fact
.job
.ff
.json (this one is unrelated, but it makes sure most editors pick it up as json)
something else?

Update "push.bash" to set the correct version (even for candidate releases)

Relates to #36