snowplow / factotum Goto Github PK
View Code? Open in Web Editor NEWA system to programmatically run data pipelines
Home Page: http://snowplowanalytics.com/blog/2016/04/09/introducing-factotum-data-pipeline-runner/
A system to programmatically run data pipelines
Home Page: http://snowplowanalytics.com/blog/2016/04/09/introducing-factotum-data-pipeline-runner/
I think the underlying graph crate may support this out of the box
We need this so we can start migrating makefiles over to Factotum.
$ factotum job make2factotum FILE
where FILE
is a minimalistic Makefile, containing only the following elements:
Example of a minimalistic Makefile that will be convertable:
pipeline=acme
done: run-web-sql
/notify-v0.2.0.sh $(pipeline) "Completed successfully"
start:
/notify-v0.2.0.sh $(pipeline) "Started"
check-lock: start
/check-lock.sh $(pipeline)
emr-etl-runner: check-lock
/r73/emr-etl-runner-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(client) "Ran EmrEtlRunner"
storage-loader: emr-etl-runner
/storage-loader-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(pipeline) "Ran StorageLoader"
run-dedupe-sql: storage-loader
/sql-runner-0.2.0-lock.sh $(pipeline) dedupe
run-web-sql: run-dedupe-sql
/sql-runner-0.2.0-lock.sh $(pipeline) web
This should display the current binary version (and should handle release candidates too)
At least to start with, we want to exit with code 0 on the case of a no-op. My question is more: in the case of a no-op, do we behave make -k
-style and process all other possible items before exiting ("lazy quit"), or do we just complete anything that was already running and quit ("eager quit"). I suppose the third option is we just kill anything else that is running ("rage quit").
Do we make this configurable on a per job basis?
Need to come up with a JSON Schema for this
This isn't a very popular feature when we have mocked it up, +1 this ticket if you want it...
After having a think about this, there's a couple of options for how to provision a job:
add a "provision" step at the root of the project which checks a box is adequately provisioned at the start of a run (a list of ansible playbooks perhaps)
After thinking about 2) I think we'd just as easily be able to specify an optional docker container at the top level of the factfile. Factotum would then download this container and run the jobs from inside the container
Docker containers would be pretty cool since I could send a factfile to anyone with an internet connection (within reason) and they could run the job locally by having Factotum download (and cache) the docker image. Provisioning would be done by building the image - it could also have a prebuilt Factotum at the right version in (much the way sbt works). Docker also allows resource constraints (like allowing only 1 cpu, Xmb mem etc) which would stop us replicating the same later somewhere else.
This ticket is a placeholder - I don't think it's quite ready to roadmap yet.
Currently a step can only depend on a step which the parser has previously seen. For example;
step a
step b
step c <-- can only depend on steps b or a, not d
step d
This simplifies the parser logic, but restricts the file layout somewhat. Factotum should also not panic if this kind of event occurs (which as of 0.1.0 it does).
As it's open source it's probably prudent to have a cursory look at how we'll do this.
I think some of the good options are;
apt-get install factotum
cargo install factotum
I think 1. is the gold standard where possible, but it carries a bit of overhead and is very linux specific.
Resumption should be stateless - the operator will need to specify which step to resume from.
In the case that a single step is ambiguous, the operator must provide a comma-separated list of steps.
A single step will be ambiguous if that step is on an execution branch which has one or more sibling execution branches. It is ambiguous because Factotum cannot know what progress was made on the sibling branch(es).
This functionality is in support of:
Pre the above tickets, some command-line tools will provide a command-line option to yield an inner Factotum DAG that can then be run from inside a parent Factotum DAG. This is a key plank of rule 2 of the Zen of Factotum:
A job must be composable from other jobs
Current thinking (feedback welcome) is as follows:
stdout
and return 0
In the integration tests for this feature, we should write a shell script which returns a simple Factotum DAG into a parent DAG, to validate that this is working correctly.
See also #40 for the sibling ticket where the simpler case of an inner DAG being retrievable from a static file is considered.
Support deploys to bintray using vagrant push
Depends on https://github.com/snowplow/makefile-rs
See other projects
Standard Vagrant setup similar to other projects
This is split off from #25 as it requires a bit of investigation and isn't critical
The idea builds on #45, but instead of manually specifying boxes, you would just let all boxes attempt to run all jobs they find in their cron, and there would be a distributed lock in e.g. Consul so that only one box is "elected" to run the job.
Taking a look at #2 each step seems to be a shell execution. Taking a look at airflow one of the nice things they have is the ability to execute different types of jobs. Gradle, Maven, SBT and alike support this via plugins. It'd be pretty cool to have a plugin style system, that could express steps in a more straightforward way.
Windows will require a bit of thought as the shell runner (sh) doesn't exist unless running in cygwin or similar.
It's quite nice how they've done this with serde, using travis-cargo. Both travis and coveralls are free for open source projects.
So as soon as one run of the job has finished, start the next one.
Need to figure out what should happen on a failure - keep looping or die?
I think @ninjabear and I discussed this last week and agreed that v0.1.0 didn't need this (because I can start testing this with pipelines with hardcoded variables)...
Current working idea:
{
"schema": "iglu:com.snowplowanalytics.factotum/job/jsonschema/1-0-0",
"data": {
"name": "My First DAG",
"arguments": [ "clientTag" ],
"assignments": {
"configDir": "/opt/mt-configs2/{{ $.arguments.clientTag }}",
"scriptDir": "/opt/mt-scripts/acme"
},
"steps": [
{
"name": "EmrEtlRunner",
"type": "shell",
"command": "{{ $.assignments.scriptDir }}/acme-emr-etl-runner.sh",
"arguments": [ "{{ $.assignments.configDir }}" ],
"dependsOn": [],
"response": {
"noOp": [ 3 ]
}
},
{
"name": "StorageLoader",
"type": "shell",
"command": "{{ $.assignments.scriptDir }}/acme-storage-loader.sh",
"arguments": [ "{{ $.assignments.configDir }}" ],
"dependsOn": [ "EmrEtlRunner" ]
},
{
"name": "SQL Runner",
"type": "shell",
"command": "/opt/sql-runner-0.2.0/sql-runner",
"arguments": [ "--playbook", "{{ $.assignments.configDir }}/sql-runner/playbooks/stage-1.yml", "--sqlroot", "{{ $.aliases.configDir }}/sql-runner/sql" ],
"dependsOn": [ "StorageLoader" ]
}
]
}
}
Imagine this Factfile step:
TODO
And imagine this is the referenced SQL Runner playbook:
TODO
Then Factotum would pre-process this into the following inner DAG:
TODO
BINTRAY_SNOWPLOW_GENERIC_USER
BINTRAY_SNOWPLOW_GENERIC_API_KEY
Working with rustc-serialize is a bit tedious, and has a few limitations. We should move to serde once the good codegen support lands in Rust stable.
Use Vault as the back-end?
As jobspec sounds like it could be a test (spec) for a job...
This can wait till next release. Here's the Scala/SBT equivalent:
https://github.com/snowplow/huskimo/blob/master/vagrant/push.bash#L54
This will need some code as it's inexpressible at the JSON Schema level.
The above image on the right displays how factotum executes steps, the numbers and colours indicate what is run with what. In this case the total run-time can be decreased considerably if 4
didn't wait for 2
to complete. This happens because 2
and 3
are in the same running batch, which waits for both to finish.
Including printing task errors to stderr, not stdout
In self-describing JSON format?
This is the simpler case of #39 - instead of having to execute a shell command to retrieve an inner DAG, we just retrieve it from a factfile on local disk.
This is the idea that Factotum should only execute if a "run constraint" specified on the CLI resolves to true.
This is a powerful feature (taken from Snowplow's internal executor), which lets you have the same Factotum invocation in the cron on multiple boxes, but only one box will execute the job.
Example cronfile:
BOX1=box1.acme.internal
BOX2=box2.acme.internal
/opt/factotum-0.4.0/factotum first-dag.factotum --constraint "host,${BOX1}"
/opt/factotum-0.4.0/factotum secnd-dag.factotum --constraint "host,${BOX2}"
To start with, we only allow one constraint, and the only allowed constraint is host
, and the implied check is ==
.
Should factotum facfiles be:
Relates to #36
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.