Git Product home page Git Product logo

factotum's People

Contributors

alexanderdean avatar jbeemster avatar ninjabear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

factotum's Issues

Add converter from Makefile to a Factotum factfile

We need this so we can start migrating makefiles over to Factotum.

$ factotum job make2factotum FILE

where FILE is a minimalistic Makefile, containing only the following elements:

  • Variables
  • Rules

Example of a minimalistic Makefile that will be convertable:

pipeline=acme

done: run-web-sql
    /notify-v0.2.0.sh $(pipeline) "Completed successfully"

start:
    /notify-v0.2.0.sh $(pipeline) "Started"
check-lock: start
    /check-lock.sh $(pipeline)
emr-etl-runner: check-lock
    /r73/emr-etl-runner-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(client) "Ran EmrEtlRunner"
storage-loader: emr-etl-runner
    /storage-loader-r73-rc2.sh $(pipeline) && /notify-v0.2.0.sh $(pipeline) "Ran StorageLoader"
run-dedupe-sql: storage-loader
    /sql-runner-0.2.0-lock.sh $(pipeline) dedupe
run-web-sql: run-dedupe-sql
    /sql-runner-0.2.0-lock.sh $(pipeline) web

Decide how to handle a no-op step

At least to start with, we want to exit with code 0 on the case of a no-op. My question is more: in the case of a no-op, do we behave make -k-style and process all other possible items before exiting ("lazy quit"), or do we just complete anything that was already running and quit ("eager quit"). I suppose the third option is we just kill anything else that is running ("rage quit").

Do we make this configurable on a per job basis?

Containers / Job provisioning

After having a think about this, there's a couple of options for how to provision a job:

  1. include the provisioning as a prerequisite task in the DAG
  • this will work with no changes - it's a bit clunky though
  1. add a "provision" step at the root of the project which checks a box is adequately provisioned at the start of a run (a list of ansible playbooks perhaps)

  2. After thinking about 2) I think we'd just as easily be able to specify an optional docker container at the top level of the factfile. Factotum would then download this container and run the jobs from inside the container

Docker containers would be pretty cool since I could send a factfile to anyone with an internet connection (within reason) and they could run the job locally by having Factotum download (and cache) the docker image. Provisioning would be done by building the image - it could also have a prebuilt Factotum at the right version in (much the way sbt works). Docker also allows resource constraints (like allowing only 1 cpu, Xmb mem etc) which would stop us replicating the same later somewhere else.

This ticket is a placeholder - I don't think it's quite ready to roadmap yet.

Support "forward references" of steps in factfiles

Currently a step can only depend on a step which the parser has previously seen. For example;

step a
step b 
step c <-- can only depend on steps b or a, not d
step d   

This simplifies the parser logic, but restricts the file layout somewhat. Factotum should also not panic if this kind of event occurs (which as of 0.1.0 it does).

Deployment and setup

As it's open source it's probably prudent to have a cursory look at how we'll do this.

I think some of the good options are;

  1. deb/rpm and a repository / ppa
    • apt-get install factotum
  2. bintray / zip of binaries
  3. use cargo / distribute via source
    • cargo install factotum

I think 1. is the gold standard where possible, but it carries a bit of overhead and is very linux specific.

Add ability to resume a failed job

Resumption should be stateless - the operator will need to specify which step to resume from.

In the case that a single step is ambiguous, the operator must provide a comma-separated list of steps.

A single step will be ambiguous if that step is on an execution branch which has one or more sibling execution branches. It is ambiguous because Factotum cannot know what progress was made on the sibling branch(es).

Placeholder for shell tasks which expand into inner DAGs

This functionality is in support of:

Pre the above tickets, some command-line tools will provide a command-line option to yield an inner Factotum DAG that can then be run from inside a parent Factotum DAG. This is a key plank of rule 2 of the Zen of Factotum:

A job must be composable from other jobs

Current thinking (feedback welcome) is as follows:

  • Add a new DAG "task"(?) type - this needs further discussion as Factotum 0.1.0 just assumes that all tasks are tasks (not inner DAGs)
  • Add support for the DAG task type to be populated at runtime using a shell command (we should futureproof the schema design so that we can also support the simpler use case where an inner DAG is held in a file, see #40, or KV-store)
  • When Factotum starts to run the job, it will evaluate the shell command which will yield the inner DAG in Factotum DAG format via stdout and return 0
  • Factotum will validate the inner DAG to ensure that it's a valid Factotum DAG
  • Factotum will then expand the parent DAG to include the inner DAG at the correct point

In the integration tests for this feature, we should write a shell script which returns a simple Factotum DAG into a parent DAG, to validate that this is working correctly.

See also #40 for the sibling ticket where the simpler case of an inner DAG being retrievable from a static file is considered.

Define actions on steps

Taking a look at #2 each step seems to be a shell execution. Taking a look at airflow one of the nice things they have is the ability to execute different types of jobs. Gradle, Maven, SBT and alike support this via plugins. It'd be pretty cool to have a plugin style system, that could express steps in a more straightforward way.

Support Windows

Windows will require a bit of thought as the shell runner (sh) doesn't exist unless running in cygwin or similar.

Define job schema

Current working idea:

{
  "schema": "iglu:com.snowplowanalytics.factotum/job/jsonschema/1-0-0",
  "data": {
    "name": "My First DAG",
    "arguments": [ "clientTag" ],
    "assignments": {
      "configDir": "/opt/mt-configs2/{{ $.arguments.clientTag }}",
      "scriptDir": "/opt/mt-scripts/acme"
    },
    "steps": [
      {
        "name": "EmrEtlRunner",
        "type": "shell",
        "command": "{{ $.assignments.scriptDir }}/acme-emr-etl-runner.sh",
        "arguments": [ "{{ $.assignments.configDir }}" ],
        "dependsOn": [],
        "response": {
          "noOp": [ 3 ]
        }
      },
      {
        "name": "StorageLoader",
        "type": "shell",
        "command": "{{ $.assignments.scriptDir }}/acme-storage-loader.sh",
        "arguments": [ "{{ $.assignments.configDir }}" ],
        "dependsOn": [ "EmrEtlRunner" ]
      },
      {
        "name": "SQL Runner",
        "type": "shell",
        "command": "/opt/sql-runner-0.2.0/sql-runner",
        "arguments": [ "--playbook", "{{ $.assignments.configDir }}/sql-runner/playbooks/stage-1.yml", "--sqlroot", "{{ $.aliases.configDir }}/sql-runner/sql" ],
        "dependsOn": [ "StorageLoader" ]
      }
    ]
  }
}

Improve efficency of execution order

factotum_tree_inefficient

The above image on the right displays how factotum executes steps, the numbers and colours indicate what is run with what. In this case the total run-time can be decreased considerably if 4 didn't wait for 2 to complete. This happens because 2 and 3 are in the same running batch, which waits for both to finish.

Add support for run constraints

This is the idea that Factotum should only execute if a "run constraint" specified on the CLI resolves to true.

This is a powerful feature (taken from Snowplow's internal executor), which lets you have the same Factotum invocation in the cron on multiple boxes, but only one box will execute the job.

Example cronfile:

BOX1=box1.acme.internal
BOX2=box2.acme.internal

/opt/factotum-0.4.0/factotum first-dag.factotum --constraint "host,${BOX1}"
/opt/factotum-0.4.0/factotum secnd-dag.factotum --constraint "host,${BOX2}"

To start with, we only allow one constraint, and the only allowed constraint is host, and the implied check is ==.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.