twosigma / cook Goto Github PK

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

License: Apache License 2.0

Clojure 65.65% Java 7.04% Shell 1.35% Python 25.64% Jupyter Notebook 0.21% Dockerfile 0.07% Makefile 0.02% Batchfile 0.02%

mesos spark cluster scheduler kubernetes gke

cook's Introduction

⚠️ Cook Scheduler Development Has Ceased

After seven years of developing Cook Scheduler we have made the decision to archive the project. Cook will remain available on GitHub in archive mode but no further development will occur.

When Cook was open sourced it solved difficult problems in on-premises, capacity-constrained data centers. Today, however, the embrace of the public cloud has changed the problems that need to be solved. This shift is also reflected in slowing community contribution to Cook and the emergence of many other open source projects in this space. Given this, it no longer makes sense for us to maintain Cook as an open source project.

We are thankful for the opportunity to have shared Cook with the community and grateful for your contributions. Two Sigma remains committed to supporting open source software. You can find out more about our other projects and contributions here: https://www.twosigma.com/open-source/.

Cook Scheduler

Welcome to Two Sigma's Cook Scheduler!

What is Cook?

Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.
Cook is able to intelligently preempt jobs to ensure that no user ever needs to wait long to get quick answers, while simultaneously helping you to achieve 90%+ utilization for massive workloads.
Cook has been battle-hardened to automatically recover after dozens of classes of cluster failures.
Cook can act as a Spark scheduler, and it comes with a REST API, Java client, Python client, and CLI.

Core concepts is a good place to start to learn more.

Releases

Check the changelog for release info.

Subproject Summary

In this repository, you'll find several subprojects, each of which has its own documentation.

scheduler - This is the actual Mesos framework, Cook. It comes with a JSON REST API.
jobclient - This includes the Java and Python APIs for Cook, both of which use the REST API under the hood.
spark - This contains the patch to Spark to enable Cook as a backend.

Please visit the scheduler subproject first to get started.

Quickstart

Using Google Kubernetes Engine (GKE)

The quickest way to get Cook running locally against GKE is with Vagrant.

Install Vagrant
Install Virtualbox
Clone down this repo
Run GCP_PROJECT_NAME=<gcp_project_name> PGPASSWORD=<random_string> vagrant up --provider=virtualbox to create the dev environment
Run vagrant ssh to ssh into the dev environment

In your Vagrant dev environment

Run gcloud auth login to login to Google cloud
Run bin/make-gke-test-clusters to create GKE clusters
Run bin/start-datomic.sh to start Datomic (Cook database) (Wait until "System started datomic:free://0.0.0.0:4334/, storing data in: data")
Run lein exec -p datomic/data/seed_k8s_pools.clj $COOK_DATOMIC_URI to seed some Cook pools in the database
Run bin/run-local-kubernetes.sh to start the Cook scheduler
Cook should now be listening locally on port 12321

To test a simple job submission:

Run cs submit --pool k8s-alpha --cpu 0.5 --mem 32 --docker-image gcr.io/google-containers/alpine-with-bash:1.0 ls to submit a simple job
Run cs show <job_uuid> to show the status of your job (it should eventually show Success)

To run automated tests:

Run lein test :all-but-benchmark to run unit tests
Run cd ../integration && pytest -m 'not cli' to run integration tests
Run cd ../integration && pytest tests/cook/test_basic.py -k test_basic_submit -n 0 -s to run a particular integration test

Using Mesos

The quickest way to get Mesos and Cook running locally is with docker and minimesos.

Install docker
Clone down this repo
cd scheduler
Run bin/build-docker-image.sh to build the Cook scheduler image
Run ../travis/minimesos up to start Mesos and ZooKeeper using minimesos
Run bin/run-docker.sh to start the Cook scheduler
Cook should now be listening locally on port 12321

Contributing

In order to accept your code contributions, please fill out the appropriate Contributor License Agreement in the cla folder and submit it to [email protected].

Disclaimer

Apache Mesos is a trademark of The Apache Software Foundation. The Apache Software Foundation is not affiliated, endorsed, connected, sponsored or otherwise associated in any way to Two Sigma, Cook, or this website in any manner.

cook's People

Contributors

Stargazers

Watchers

cook's Issues

Move to Metrics library in mesos/monitor.clj to standardize metrics reporting

In mesos/monitor.clj, we currently report metrics on user waiting/running jobs/cpus/mem by sending riemann events directly. Instead, we should:

(1) Have a chime process query database and store the results in atoms/async-channels
(2) Have a go-loop that looks at the atoms/async-channels and register/deregister gauges.

Per discussion with @dgrnbrg

Add python client

#4 will make this 1000x easier.

Make an API to configure quotas

Currently, you need to transact quota changes. These should be in a config file or a REST Api.

Add support for HTTP Basic auth to Scheduler and Jobclient

This is necessary so that normal, non-kerberos environments can use cook.

Integrate fenzo as new scheduler

Add support for ports in Cook tasks

Document adding the Cook jar to Datomic classpath for production

Cook's transaction functions expect to be able to call functions in Cook proper; the transactor will blow up if you don't do this, so we should document how to do so.

Provide ability to update Priority of a Waiting job.

Users would like programmatic ability to sort the queue. More generally, can we provide the ability to update any mutable property of a job via an UPDATE method.

Create new grpc based transports

Remove "BasicKerberizedHttpClient" in JobClient

This is a wrapper that just slightly changes the HttpClient API. HttpClient already supports every feature in this class.

Make the HTTP API use the same restrictions as the scheduler

Currently, the API hardcodes 32 CPUs and 200GB ram for blocking HTTP job submission. The API should instead use the same scheduler constraints that are provided as :task-constraints.

Add Spark parameters for configuring Cook binding

Besides setting the CPUs and Memory for each executor, we should be able to specify additional URIs or environment variables to retrieve for the executor, and the min threshold of running executors to wait for until we start computing.

Cannot start cook with dev/prod datomic

The issue is dev/prod datomic needs metatrasaction jar, but currenly metatrasaction is inside scheduler project so I cannot compile a standalone metatrasaction jar.

Add support for riemann reporter with metrics v3

On the earliest commits, it seems to have metrics v3 and v2 coexisting. I don't understand why that would work, but it should be possible to build this correctly using compatible versions.

Add support for URIs and Ports in endpoint

This is being worked on in the uris_and_ports branch. URI support is done; ports support is ongoing.

Add support for other databases

The first step here is determining the types of queries we do. This issue should be updated with the current list:

Find all jobs of a particular status
Find all non-terminal instances
Query a particular job or instance by ID

The status-related queries require second indices, but we could change instance IDs to be [jobid instanceid] pairs, so that we only need to implement lookup by job id, and then we'd just store a "document" with the full job & instance state.

Still to be analyzed:

What's the impact on the metrics reporter's user stats?
How critical are the transaction functions? Could we change them to run locally, or all be CAS-based?
Can we refactor the use of the Datomic txn log tailer to be totally local, core.async, and per-process?

Simplify and improve directions for building spark

This should be really easy to patch spark!

Make the minimum safe quota and the weighting independently settable

I believe these are currently always the same number. Some users might get a bigger unkillable quota, but have an equal weighting in the sharing system.

Logo for Cook

Cook should have a logo!

Document where libmesos is

Usually in libmesos.so / libmesos.dylib (Linux/Mac) are in /usr/lib or /usr/local/lib, but in my case, were in my $MESOS_BUILD_DIR/src/.libs/ -- need to understand why this was and document each edge.

Add support for Spark's Rest Server API

@tnachen pointed https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala out to me. We could implement this for Cook to simplify submitting prod jobs to Spark via Cook.

Document recommended cook jvm options

This can include datomic settings, g1 settings, etc

Propagate the reason field for failed tasks

The reason we show in json will be the enum value, but all lowercase, changing underscores to spaces, and dropping the leading reason.

Discussed with @icexelloss.

Unable to start server from checkout

I checked out a8e1c67 and tried to run lein run dev-config.edn but it failed to run because of missing dependencies. It seems tags referenced on line 318 of components are not defined anywhere. I commented out the expression, but then I got another error about cook.reporter on line 324 and commented that expression out as well. After those changes I was able to run.

I was able to get it all running in less than 30 minutes, including pulling dependencies and debugging this. Thanks for making it easy.

Cook shouldn't change instance status to failed unless it knows the task has failed

instance status should reflect the fact. However, we currently change instance status to failed to kill it. I think this is not ideal because when user sees instance status = failed in Cook, it should be the case that the task indeed fails, that is, cook receives task-failed/task-error/(maybe task-lost in some cases) from mesos.

The places we currently change instance status to failed are:
(1) To preempt a task
(2) To kill a task due to heartbeat timeout

@dgrnbrg let me know what you think

Document all spark configuration options

We should have a table documenting all the properties that you can configure for the Cook spark binding.

Provide an HTTP based job tracking endpoint

Users would like to be able to query Cook for the list of their running and waiting jobs. We've discussed this at length internally but I'd like to bring this to the open source for design, review and implementation.

Allow DNS and Hostname to be configurable

Either this should be configurable, or they're no longer in use. Either way, we should figure them out!

Remove the default settings of properties in rest api implementation

Things like max-runtime and priority are being set and written to the DB, rather than picking up defaults in the scheduler. This makes the schema and DB use ugly and inefficient; these should be cleaned up.

Add graphite metrics reporting

Release cook-jobclient

After discussion with @jcoveney, we've decided to release this to Maven central.

Does the spark patch actually need the remoteConfFetch code?

We need to test against kerberized hadoop to ensure that this isn't needed--it was added as a kerberos support hack, but its time might've expired.

Benchmark time to schedule a workload

This will give us an idea of how long it should take to start some number of jobs, of various sizes.

The motivation is to understand how long it should take to launch a Spark cluster, so that we can figure out how multitenancy affects this, and if something special is needed.

Running job status not updated in mesos 0.23 and Cook

I submitted a job via cook to a mesos 0.23 cluster. Everything seems to have worked fine, but the instances[0].status and framework_id are not getting set. On the mesos page, I do see the job as running and cook scheduler as a registered framework.

[
{
mem: 16,
max_retries: 3,
max_runtime: 86400000,
name: "cookjob",
command: "while [ true ]; do echo hello cook I am "$(whoami)" and MY_VAR="${MY_VAR}"; sleep 10; done",
env: {
MY_VAR: "foo1"
},
framework_id: null,
instances: [
{
start_time: 1444169356373,
task_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
hostname: "some.host.domain.com",
slave_id: "20151006-201511-738201772-5050-93146-S8",
executor_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
status: "unknown"
}
],
priority: 50,
status: "waiting",
uuid: "f76aa5bd-e4bb-4ef3-9ad4-5b2938efc0fd",
uris: null,
cpus: 0.5
}
]

Document Scheduler configuration

Make sure that we have a sample dev config (should work out of the box) & prod config (should have comments to explain some choices).

This also should have details on the recommended production JVM options, and why to use them (Datomic using extra heap as cache, debugging GC pauses, etc).

All options should be documented in the asciidoc.

Zookeeper needed for dev-config

In running lein run dev-config.edn I get

2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [::1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client
2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client

which resolves once I start running a Zookeeper locally

2015-09-21 19:21:20,806:22178(0x116b06000):ZOO_INFO@check_events@1703: initiated connection to server [fe80::1:2181]
2015-09-21 19:21:21,191:22178(0x116b06000):ZOO_INFO@check_events@1750: session establishment complete on server [fe80::1:2181], sessionId=0x14ff236287a0000, negotiated timeout=10000
I0921 19:21:21.191696 327958528 group.cpp:313] Group process (group(1)@127.0.0.1:56667) connected to ZooKeeper
I0921 19:21:21.191776 327958528 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0921 19:21:21.191836 327958528 group.cpp:385] Trying to create path '/mesos' in ZooKeeper

Is this expected? From the documentation,

Cook is written in Clojure. To develop Cook, all you need is a JVM and Mesos installed and configured. Cook will automatically start embedded copies of the rest of its dependencies."

I thought I would not need any dependencies when running in dev-mode.

Release metrics-clj to clojars

Analyze TODOs under the Spark patch

They should be able to be resolved now.

Add federation to Cook REST API

Here's an example of what the config file could look like:

 :federation {:remotes ["http://localhost:12322"]
              :priviledged-principal "admin"
              :threads 4
              :circuit-breaker {:failure-threshold 0
                                :lifetime-ms 60000
                                :response-timeout-ms 60000
                                :reset-timeout-ms 60000
                                :failure-logger-size 10000}}

lein uberjar from scheduler subdir failed

ljin@hsljin:~/ws/github/Cook/scheduler$ lein uberjar
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 5555; nested exception is:
java.net.BindException: Address already in use
Compilation failed: Subprocess failed

Update spark to latest build

This includes using the new Cook Environment variables and URI APIs, integrating the latest code into spark 1.5, documenting the instructions for building off spark 1.5.

This should also add support so that the URI either uses Basic auth or kerberos, depending on if the URI is of the form cook://user:pass@host:port or simply cook://host:port.

Add support for host constraints

This should be for things like "only on hosts w/ a specific attribute". This will enable things like GPU or machine class aware scheduling.

This will need to be added to the client-facing API, as well as to the scheduler & db.

Document how to build cook with datomic pro

Currently, datomic free edition jars are available in public maven repos, so lein is happy with building against it. But to use datomic pro, one has to maven install the licensed jars in local maven repo before building it. The documentation on that whole process is a little sparse. I found http://aan.io/datomic-pro-and-leiningen/ this useful after googling around.

Current documentation suggests that switching to datomic pro is as simple as s/datomic-free/datomic-pro/g project.clj.

Ensure all scheduler tests pass

Some of the scheduler tests currently seem to have a deadlock in them.

Add support for terminal task failure

Currently cook scheduler always retries job when it fails. However, sometimes executor can determine a job fails permanently and therefore there is no point in retrying, in this case, we should allow executor to tell cook scheduler not to retry the job.

To implement this, we can leverage data field in TaskStatus. We can start including metadata (a json map, maybe) along with TaskStatus and this will just be a "terminal-failure": "true" entry.

Cook scheduler can simply set job state to complete when it sees "terminal-failure": "true" from a task status

Add travis.ci config

For building multiple projects, this article seems useful: https://lord.io/blog/2014/travis-multiple-subdirs/

Make datomic pruning helper tool

Provide working example of using the agent

This will be a showcase for the heartbeat feature in the Cook scheduler.

Change the way we load Mesos in travis to enable moving to travis container infra

This will require submitting a request to here: https://github.com/travis-ci/apt-source-whitelist

Or we can download and install/unpack/build (or grab binaries) Mesos ourselves

But this also has the problem/downside that to get them added to the whitelist, they seem to need source packages. And to use the cache (necessary for building the package), we'd need to be a paying Travis customer.

This is trickier than I initially thought.

Test protobuf <-> datomic roundtrips

This is meant to test that we can submit some JSON through the rest api, see it hit Datomic, then convert that to a protobuf, then follow the whole roundtrip back. This could catch potentially unknown serialization/format munging bugs, since we represent job data as Clojure datastructures, Mesos protobufs, Datomic datoms, and JSON objects.