Git Product home page Git Product logo

cook's Introduction

⚠️ Cook Scheduler Development Has Ceased

After seven years of developing Cook Scheduler we have made the decision to archive the project. Cook will remain available on GitHub in archive mode but no further development will occur.

When Cook was open sourced it solved difficult problems in on-premises, capacity-constrained data centers. Today, however, the embrace of the public cloud has changed the problems that need to be solved. This shift is also reflected in slowing community contribution to Cook and the emergence of many other open source projects in this space. Given this, it no longer makes sense for us to maintain Cook as an open source project.

We are thankful for the opportunity to have shared Cook with the community and grateful for your contributions. Two Sigma remains committed to supporting open source software. You can find out more about our other projects and contributions here: https://www.twosigma.com/open-source/.

Cook Scheduler

Welcome to Two Sigma's Cook Scheduler!

What is Cook?

  • Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.
  • Cook is able to intelligently preempt jobs to ensure that no user ever needs to wait long to get quick answers, while simultaneously helping you to achieve 90%+ utilization for massive workloads.
  • Cook has been battle-hardened to automatically recover after dozens of classes of cluster failures.
  • Cook can act as a Spark scheduler, and it comes with a REST API, Java client, Python client, and CLI.

Core concepts is a good place to start to learn more.

Releases

Check the changelog for release info.

Subproject Summary

In this repository, you'll find several subprojects, each of which has its own documentation.

  • scheduler - This is the actual Mesos framework, Cook. It comes with a JSON REST API.
  • jobclient - This includes the Java and Python APIs for Cook, both of which use the REST API under the hood.
  • spark - This contains the patch to Spark to enable Cook as a backend.

Please visit the scheduler subproject first to get started.

Quickstart

Using Google Kubernetes Engine (GKE)

The quickest way to get Cook running locally against GKE is with Vagrant.

  1. Install Vagrant
  2. Install Virtualbox
  3. Clone down this repo
  4. Run GCP_PROJECT_NAME=<gcp_project_name> PGPASSWORD=<random_string> vagrant up --provider=virtualbox to create the dev environment
  5. Run vagrant ssh to ssh into the dev environment

In your Vagrant dev environment

  1. Run gcloud auth login to login to Google cloud
  2. Run bin/make-gke-test-clusters to create GKE clusters
  3. Run bin/start-datomic.sh to start Datomic (Cook database) (Wait until "System started datomic:free://0.0.0.0:4334/, storing data in: data")
  4. Run lein exec -p datomic/data/seed_k8s_pools.clj $COOK_DATOMIC_URI to seed some Cook pools in the database
  5. Run bin/run-local-kubernetes.sh to start the Cook scheduler
  6. Cook should now be listening locally on port 12321

To test a simple job submission:

  1. Run cs submit --pool k8s-alpha --cpu 0.5 --mem 32 --docker-image gcr.io/google-containers/alpine-with-bash:1.0 ls to submit a simple job
  2. Run cs show <job_uuid> to show the status of your job (it should eventually show Success)

To run automated tests:

  1. Run lein test :all-but-benchmark to run unit tests
  2. Run cd ../integration && pytest -m 'not cli' to run integration tests
  3. Run cd ../integration && pytest tests/cook/test_basic.py -k test_basic_submit -n 0 -s to run a particular integration test

Using Mesos

The quickest way to get Mesos and Cook running locally is with docker and minimesos.

  1. Install docker
  2. Clone down this repo
  3. cd scheduler
  4. Run bin/build-docker-image.sh to build the Cook scheduler image
  5. Run ../travis/minimesos up to start Mesos and ZooKeeper using minimesos
  6. Run bin/run-docker.sh to start the Cook scheduler
  7. Cook should now be listening locally on port 12321

Contributing

In order to accept your code contributions, please fill out the appropriate Contributor License Agreement in the cla folder and submit it to [email protected].

Disclaimer

Apache Mesos is a trademark of The Apache Software Foundation. The Apache Software Foundation is not affiliated, endorsed, connected, sponsored or otherwise associated in any way to Two Sigma, Cook, or this website in any manner.

© Two Sigma Open Source, LLC

cook's People

Contributors

ahaysx avatar bolina avatar brianbao avatar calebhar12 avatar cge0516 avatar daowen avatar dependabot[bot] avatar dgrnbrg avatar diegoalbertotorres avatar dposada avatar gerrymanoim avatar icexelloss avatar jhn avatar kathryn-zhou avatar laurameng avatar leifwalsh avatar lewisheadden avatar mayurjpatel avatar mforsyth avatar nsinkov avatar pschorf avatar rmanyari avatar samincheva avatar scrosby avatar shamsimam avatar sophaskins avatar sradack avatar wenbozhao avatar wyegelwel avatar yueri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cook's Issues

Move to Metrics library in mesos/monitor.clj to standardize metrics reporting

In mesos/monitor.clj, we currently report metrics on user waiting/running jobs/cpus/mem by sending riemann events directly. Instead, we should:

(1) Have a chime process query database and store the results in atoms/async-channels
(2) Have a go-loop that looks at the atoms/async-channels and register/deregister gauges.

Per discussion with @dgrnbrg

Add Spark parameters for configuring Cook binding

Besides setting the CPUs and Memory for each executor, we should be able to specify additional URIs or environment variables to retrieve for the executor, and the min threshold of running executors to wait for until we start computing.

Cannot start cook with dev/prod datomic

The issue is dev/prod datomic needs metatrasaction jar, but currenly metatrasaction is inside scheduler project so I cannot compile a standalone metatrasaction jar.

Add support for other databases

The first step here is determining the types of queries we do. This issue should be updated with the current list:

  • Find all jobs of a particular status
  • Find all non-terminal instances
  • Query a particular job or instance by ID

The status-related queries require second indices, but we could change instance IDs to be [jobid instanceid] pairs, so that we only need to implement lookup by job id, and then we'd just store a "document" with the full job & instance state.

Still to be analyzed:

  • What's the impact on the metrics reporter's user stats?
  • How critical are the transaction functions? Could we change them to run locally, or all be CAS-based?
  • Can we refactor the use of the Datomic txn log tailer to be totally local, core.async, and per-process?

Document where libmesos is

Usually in libmesos.so / libmesos.dylib (Linux/Mac) are in /usr/lib or /usr/local/lib, but in my case, were in my $MESOS_BUILD_DIR/src/.libs/ -- need to understand why this was and document each edge.

Unable to start server from checkout

I checked out a8e1c67 and tried to run lein run dev-config.edn but it failed to run because of missing dependencies. It seems tags referenced on line 318 of components are not defined anywhere. I commented out the expression, but then I got another error about cook.reporter on line 324 and commented that expression out as well. After those changes I was able to run.

I was able to get it all running in less than 30 minutes, including pulling dependencies and debugging this. Thanks for making it easy.

Cook shouldn't change instance status to failed unless it knows the task has failed

instance status should reflect the fact. However, we currently change instance status to failed to kill it. I think this is not ideal because when user sees instance status = failed in Cook, it should be the case that the task indeed fails, that is, cook receives task-failed/task-error/(maybe task-lost in some cases) from mesos.

The places we currently change instance status to failed are:
(1) To preempt a task
(2) To kill a task due to heartbeat timeout

@dgrnbrg let me know what you think

Provide an HTTP based job tracking endpoint

Users would like to be able to query Cook for the list of their running and waiting jobs. We've discussed this at length internally but I'd like to bring this to the open source for design, review and implementation.

Benchmark time to schedule a workload

This will give us an idea of how long it should take to start some number of jobs, of various sizes.

The motivation is to understand how long it should take to launch a Spark cluster, so that we can figure out how multitenancy affects this, and if something special is needed.

Running job status not updated in mesos 0.23 and Cook

I submitted a job via cook to a mesos 0.23 cluster. Everything seems to have worked fine, but the instances[0].status and framework_id are not getting set. On the mesos page, I do see the job as running and cook scheduler as a registered framework.

[
{
mem: 16,
max_retries: 3,
max_runtime: 86400000,
name: "cookjob",
command: "while [ true ]; do echo hello cook I am "$(whoami)" and MY_VAR="${MY_VAR}"; sleep 10; done",
env: {
MY_VAR: "foo1"
},
framework_id: null,
instances: [
{
start_time: 1444169356373,
task_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
hostname: "some.host.domain.com",
slave_id: "20151006-201511-738201772-5050-93146-S8",
executor_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
status: "unknown"
}
],
priority: 50,
status: "waiting",
uuid: "f76aa5bd-e4bb-4ef3-9ad4-5b2938efc0fd",
uris: null,
cpus: 0.5
}
]

Document Scheduler configuration

Make sure that we have a sample dev config (should work out of the box) & prod config (should have comments to explain some choices).

This also should have details on the recommended production JVM options, and why to use them (Datomic using extra heap as cache, debugging GC pauses, etc).

All options should be documented in the asciidoc.

Zookeeper needed for dev-config

In running lein run dev-config.edn I get

2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [::1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client
2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client

which resolves once I start running a Zookeeper locally

2015-09-21 19:21:20,806:22178(0x116b06000):ZOO_INFO@check_events@1703: initiated connection to server [fe80::1:2181]
2015-09-21 19:21:21,191:22178(0x116b06000):ZOO_INFO@check_events@1750: session establishment complete on server [fe80::1:2181], sessionId=0x14ff236287a0000, negotiated timeout=10000
I0921 19:21:21.191696 327958528 group.cpp:313] Group process (group(1)@127.0.0.1:56667) connected to ZooKeeper
I0921 19:21:21.191776 327958528 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0921 19:21:21.191836 327958528 group.cpp:385] Trying to create path '/mesos' in ZooKeeper

Is this expected? From the documentation,

Cook is written in Clojure. To develop Cook, all you need is a JVM and Mesos installed and configured. Cook will automatically start embedded copies of the rest of its dependencies."

I thought I would not need any dependencies when running in dev-mode.

Add federation to Cook REST API

Here's an example of what the config file could look like:

 :federation {:remotes ["http://localhost:12322"]
              :priviledged-principal "admin"
              :threads 4
              :circuit-breaker {:failure-threshold 0
                                :lifetime-ms 60000
                                :response-timeout-ms 60000
                                :reset-timeout-ms 60000
                                :failure-logger-size 10000}}

lein uberjar from scheduler subdir failed

ljin@hsljin:~/ws/github/Cook/scheduler$ lein uberjar
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 5555; nested exception is:
java.net.BindException: Address already in use
Compilation failed: Subprocess failed

Update spark to latest build

This includes using the new Cook Environment variables and URI APIs, integrating the latest code into spark 1.5, documenting the instructions for building off spark 1.5.

This should also add support so that the URI either uses Basic auth or kerberos, depending on if the URI is of the form cook://user:pass@host:port or simply cook://host:port.

Add support for host constraints

This should be for things like "only on hosts w/ a specific attribute". This will enable things like GPU or machine class aware scheduling.

This will need to be added to the client-facing API, as well as to the scheduler & db.

Document how to build cook with datomic pro

Currently, datomic free edition jars are available in public maven repos, so lein is happy with building against it. But to use datomic pro, one has to maven install the licensed jars in local maven repo before building it. The documentation on that whole process is a little sparse. I found http://aan.io/datomic-pro-and-leiningen/ this useful after googling around.

Current documentation suggests that switching to datomic pro is as simple as s/datomic-free/datomic-pro/g project.clj.

Add support for terminal task failure

Currently cook scheduler always retries job when it fails. However, sometimes executor can determine a job fails permanently and therefore there is no point in retrying, in this case, we should allow executor to tell cook scheduler not to retry the job.

To implement this, we can leverage data field in TaskStatus. We can start including metadata (a json map, maybe) along with TaskStatus and this will just be a "terminal-failure": "true" entry.

Cook scheduler can simply set job state to complete when it sees "terminal-failure": "true" from a task status

Change the way we load Mesos in travis to enable moving to travis container infra

This will require submitting a request to here: https://github.com/travis-ci/apt-source-whitelist

Or we can download and install/unpack/build (or grab binaries) Mesos ourselves

But this also has the problem/downside that to get them added to the whitelist, they seem to need source packages. And to use the cache (necessary for building the package), we'd need to be a paying Travis customer.

This is trickier than I initially thought.

Test protobuf <-> datomic roundtrips

This is meant to test that we can submit some JSON through the rest api, see it hit Datomic, then convert that to a protobuf, then follow the whole roundtrip back. This could catch potentially unknown serialization/format munging bugs, since we represent job data as Clojure datastructures, Mesos protobufs, Datomic datoms, and JSON objects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.