twosigma / waiter Goto Github PK

Runs, manages, and autoscales web services on Mesos and Kubernetes

License: Apache License 2.0

Clojure 89.55% HTML 0.59% Shell 1.04% Python 7.17% Dockerfile 0.02% Java 1.62%

autoscaling mesos marathon scheduling websocket http-proxy kubernetes

waiter's Introduction

We've decided to archive the Waiter project — it will remain available on GitHub in archive mode, but active development has ceased. We're pleased to have had the opportunity to share Waiter with the OSS community over the last five years, and are grateful for your contributions.

We continue to support other OSS programs. You can find out more about our other projects and contributions at https://www.twosigma.com/open-source/.

Waiter

Welcome to Two Sigma's Waiter project!

Waiter is a web service platform that runs, manages, and automatically scales services without requiring human intervention.

Waiter Design is a good place to start to learn more.

Subproject Summary

In this repository, you'll find two subprojects, each with its own documentation.

waiter - This is the actual web service platform, Waiter. It comes with a JSON REST API.
kitchen - This is the kitchen application, a test app used by the Waiter integration tests.

Please visit the waiter subproject first to get started.

Quickstart

The quickest way to get Mesos, Marathon, and Waiter running locally is with docker and minimesos.

Install docker
Install minimesos
Clone down this repo
Run containers/bin/build-docker-images.sh to build the minimesos agent image with kitchen and other test apps baked in
cd waiter
Run minimesos up to start Mesos, ZooKeeper, and Marathon
Run bin/run-using-minimesos.sh to start Waiter
Waiter should now be listening locally on port 9091

Quickstart (local shell scheduling)

Waiter can also be run without Mesos and Marathon, using the "shell scheduler". Note that this scheduler should only be used for testing purposes, not in production.

Ensure setsid is installed on your system and on your path
Clone down this repo
cd waiter
Run bin/run-using-shell-scheduler.sh to start Waiter
Waiter should now be listening locally on port 9091

Contributing

In order to accept your code contributions, please fill out the appropriate Contributor License Agreement in the cla folder and submit it to [email protected].

Disclaimer

Apache Mesos is a trademark of The Apache Software Foundation. The Apache Software Foundation is not affiliated, endorsed, connected, sponsored or otherwise associated in any way to Two Sigma, Waiter, or this website in any manner.

waiter's People

Contributors

Stargazers

Watchers

Forkers

dposada pschorf chzou sradack www3838438 mokshjawa nsinkov twosigmajab scrosby getcloud zaurbek d3v3l0 olamy geofft amaheshwari25 terry1504 spamarti

waiter's Issues

Missing Python 3.6 in Travis-CI container

The Kitchen tests are no longer working because the pyenv setup is failing. It looks like Travis has removed Python 3.6 from the container image we're using (trusty deprecated-2017Q4):

$ cd kitchen && ./bin/ci/setup.sh
+pyenv global 3.6
pyenv: version `3.6' not installed
+python --version
Python 2.7.6
+python3 --version
Python 3.4.3

Use `with-service-cleanup` in integration tests

Make K8s pod terminationGracePeriodSeconds configurable

Currently hard-coded to 20 seconds in the pod template inlined in default-replicaset-builder (scheduler/kuberneted.clj). It would be nice to have this configurable.

Shell scheduler should remember, validate and initialize processes across restarts

204 No Content with non-zero Content-Length causes 500 in Waiter

The following service causes Waiter to return a 500 error on the client side; however, the request log shows a 204 status code, and nothing about the error pops up in the normal waiter log.

{
 "name": "nc204-app",
 "cpus": 0.1,
 "mem": 128,
 "ports": 1,
 "version": "v4",
 "cmd": "ncat -klp $PORT0 -c 'printf \"HTTP/1.1 204 No Content\r\nContent-Type: text/plain\r\nContent-Length: 1\r\n\r\nX\"'",
 "health-check-url": "/status",
 "token": "nc204-app",
 "permitted-user": "*",
 "idle-timeout-mins": 30,
 "instance-expiry-mins": 10,
 "health-check-interval-secs": 5
}

We should consider either finding a way to pass this through (maybe after dropping the content), or log the error and the actual return code.

Do not consume K8s pod restart counts

The restartCount value on a pod's containers is not reliable. According to the current API documentation:

The number of times the container has been restarted, currently based on the number of dead containers that have not yet been removed. Note that this is calculated from dead containers. But those containers are subject to garbage collection. This value will get capped at 5 by GC.

The number 5 doesn't seem to be enforced in practice (probably because the API for setting that number was actually deprecated a while ago); however, it's still the case that the number is not reliable.

We should switch to using a more reliable value everywhere that we use the restart count. The container start time might be a good candidate.

cla needs to be in the top-level directory

Split Authenticator to decouple orthogonal functions

Our current authenticator interface handles two orthogonal checks:

HTTP client authentication (e.g., basic auth or kerberos)
Permission-checking for service creation (e.g., checking for prestashed kerberos tickets in the target compute cluster)

These should be decoupled to make reuse of the client authentication portion easier when we need to change the permission-checking portion.

Metrics for K8s state watch threads

We should maintain some kind of metrics that let us know:

About the number of entities we are receiving in the updates.
The intervals at which we are receiving updates from the k8s api.
The number of restarts.

https://github.com/twosigma/waiter/pull/479/files#r226777466

test-incoming-router-metrics-handler-valid-handshake periodically hangs

test-incoming-router-metrics-handler-valid-handshake periodically hangs, causing the travis build to fail.

Example here:
https://travis-ci.org/twosigma/waiter/jobs/248215033#L8668-L8684

waiter.metrics-test/test-service-counter is flaky

START:  waiter.metrics-test/test-service-counter
lein test :only waiter.metrics-test/test-service-counter
FAIL in (test-service-counter) (metrics_test.clj:65)
expected: (every? (fn* [p1__55911#] (str/starts-with? p1__55911# "services.service-id.")) (.getNames mc/default-registry))
  actual: (not (every? #object[waiter.metrics_test$fn__55912$fn__55915$fn__55916 0x79a6f498 "waiter.metrics_test$fn__55912$fn__55915$fn__55916@79a6f498"] #{"services.service-id.counters.fee.fie" "services.service-id.counters.foo" "services.service-id.counters.foo.bar" "waiter.interstitial.counters.promise.resolved" "waiter.interstitial.counters.resolution.interstitial-timeout"}))
lein test :only waiter.metrics-test/test-service-counter
FAIL in (test-service-counter) (metrics_test.clj:66)
expected: 3
  actual: 5
	 FINISH: waiter.metrics-test/test-service-counter 12ms {:test 21, :pass 93, :fail 2, :error 0, :running 0}

/waiter-async routes should require authentication

Currently the /waiter-async endpoints don't seem to require authentication, e.g.:

https://github.com/twosigma/waiter/blob/master/waiter/src/waiter/core.clj#L851-L856

And they probably should.

Security alert for kitchen dependency

https://github.com/twosigma/waiter/network/alert/kitchen/requirements_test.txt/requests/open

Upgrade requests to version 2.20.0 or later. For example:

requests>=2.20.0

interstitial page unit tests updating metrics asynchronously?

I just found a very perplexing failure on Travis in our Waiter Unit Test logs:

lein test :only waiter.metrics-test/test-waiter-timer

FAIL in (test-waiter-timer) (metrics_test.clj:140)
expected: (every? (fn* [p1__56780#] (str/starts-with? p1__56780# "waiter.core.")) (.getNames mc/default-registry))
  actual: (not (every? #object[waiter.metrics_test$fn__56781$fn__56790$fn__56791 0x6aacbac9 "waiter.metrics_test$fn__56781$fn__56790$fn__56791@6aacbac9"] #{"waiter.core.timers.fee.fie" "waiter.core.timers.foo" "waiter.core.timers.foo.bar" "waiter.interstitial.counters.promise.resolved" "waiter.interstitial.counters.resolution.interstitial-timeout"}))

Here's the corresponding test code:

(deftest test-waiter-timer
  (let [all-metrics-match-filter (reify MetricFilter (matches [_ _ _] true))]
    (.removeMatching mc/default-registry all-metrics-match-filter)
    (waiter-timer "core" "foo")
    (waiter-timer "core" "foo" "bar")
    (waiter-timer "core" "fee" "fie")
    (is (every? #(str/starts-with? % "waiter.core.") (.getNames mc/default-registry)))
    (= 3 (count (.getTimers mc/default-registry all-metrics-match-filter)))
    (.removeMatching mc/default-registry all-metrics-match-filter)))

That code looks very predictable. The only scenario I can come up with where some "waiter.interstitial" metrics get injected between the call to .removeMatching and the call to .getTimers is if some previous unit test for interstitial pages started some asynchronous thread that kept running after the test completed, and then updated "waiter.interstitial.counters.promise.resolved" etc. asynchronously in the middle of this completely unrelated metrics test.

Consider using guava for caching

test-health-check-timed-out is flakey

The test-health-check-timed-out test has been flaking out. Shams said he's seen it fail many times. I saw it fail twice in a row today in the shell-slow integration suite.

I'm including part of the log below. I have the full log saved if needed.

lein parallel-test :only waiter.deployment-errors-test/test-health-check-timed-out

FAIL in (test-health-check-timed-out) (deployment_errors_test.clj:89)
test-health-check-timed-out
router->service-state:
{"2301af2d04-586356b53c4f6dcf"
 {:router-id "2301af2d04-586356b53c4f6dcf",
  :state
  {:scheduler-state
   {:id->instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:16:50.759Z",
      :port 10010,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :host "127.0.0.6",
      :killed? true,
      :pid 5907,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :flags ["never-passed-health-checks"],
      :failed? true},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:10.342Z",
      :port 10002,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :host "127.0.0.4",
      :killed? true,
      :pid 5949,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :flags ["never-passed-health-checks"],
      :failed? true},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags []}},
    :instance-id->failed-health-check-count
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     1},
    :instance-id->tracked-failed-instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:16:50.759Z",
      :process "java.lang.UNIXProcess@72cfbadb",
      :port 10010,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :host "127.0.0.6",
      :pid 5907,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :flags
      ["has-responded" "has-connected" "never-passed-health-checks"]},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:10.342Z",
      :process "java.lang.UNIXProcess@4469b76e",
      :port 10002,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :host "127.0.0.4",
      :pid 5949,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :flags
      ["has-responded" "has-connected" "never-passed-health-checks"]}},
    :instance-id->unhealthy-instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags ["has-connected"]}},
    :last-update-time "2018-04-05T19:17:35.367Z",
    :service
    {:environment
     {:HOME "/home/travis/build/twosigma/waiter/waiter/scheduler",
      :LOGNAME "travis",
      :USER "travis",
      :WAITER_CPUS "0.1",
      :WAITER_MEM_MB "256",
      :WAITER_PASSWORD "dec7c87f26cd47c06e4213271b80432a",
      :WAITER_USERNAME "waiter"},
     :id
     "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
     :instances 1,
     :service-description
     {:max-queue-length 1000000,
      :idle-timeout-mins 10,
      :mem 256,
      :min-instances 1,
      :interstitial-secs 0,
      :name "testhealthchecktimedouttravis341824",
      :grace-period-secs 15,
      :env {},
      :max-instances 500,
      :cmd-type "shell",
      :scale-up-factor 0.1,
      :concurrency-level 1,
      :permitted-user "travis",
      :run-as-user "travis",
      :ports 1,
      :health-check-max-consecutive-failures 1,
      :authentication "standard",
      :health-check-interval-secs 5,
      :scale-down-factor 0.001,
      :restart-backoff-factor 2,
      :instance-expiry-mins 7200,
      :cmd
      "/home/travis/build/twosigma/waiter/waiter/bin/ci/../../../kitchen/bin/run.sh -p $PORT0",
      :distribution-scheme "balanced",
      :scale-factor 1,
      :version "version-does-not-matter",
      :health-check-url "/sleep?sleep-ms=300000&status=400",
      :blacklist-on-503 true,
      :metadata {},
      :jitter-threshold 0.5,
      :expired-instance-restart-rate 0.1,
      :backend-proto "http",
      :metric-group "waiter_kitchen",
      :cpus 0.1},
     :task-count 1,
     :task-stats {:healthy 0, :running 1, :staged 0, :unhealthy 1},
     :mem 256}},
   :autoscaler-state
   {:expired-instances 0,
    :healthy-instances 0,
    :instances 1,
    :outstanding-requests 1,
    :scale-amount 0,
    :scale-to-instances 1,
    :target-instances 1.0,
    :task-count 1},
   :scheduler-services-gc-state {},
   :autoscaling-multiplexer-state "no-data-available",
   :app-maintainer-state
   {:last-state-update-time "2018-04-05T19:17:35.367Z",
    :maintainer-chan-available true},
   :responder-state
   {:deployment-error "invalid-health-check-response",
    :request-id->work-stealer {},
    :work-stealing-queue [],
    :instance-id->request-id->use-reason-map {},
    :instance-id->consecutive-failures {},
    :instance-id->blacklist-expiry-time {},
    :instance-id->state
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:slots-assigned 0,
      :slots-used 0,
      :status-tags ["starting" "unhealthy"]}},
    :id->instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags ["has-connected"]}},
    :sorted-instance-ids
    ["waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f"]},
   :transient-metrics-gc-state
   {:last-modified-time "2018-04-05T19:16:55.250Z",
    :state {:alive? false, :outstanding 1, :total 1}},
   :local-usage {:last-request-time "2018-04-05T19:16:50.734Z"},
   :scheduler-broken-services-gc-state {},
   :interstitial-maintainer-state {:available true},
   :work-stealing-state
   {:iteration 399,
    :request-id->work-stealer {},
    :router-id->help-required {},
    :router-id->metrics
    {:2301af2d04-586356b53c4f6dcf
     {:last-request-time "2018-04-05T19:16:50.734Z",
      :outstanding 0,
      :slots-available 0,
      :slots-in-use 0,
      :slots-received 0,
      :total 1}},
    :slots {:offerable 0, :offered 0}}}}}

expected: (clojure.string/includes? body__43965__auto__ (waiter.deployment-errors-test/deployment-error->str waiter-url :health-check-timed-out))
  actual: (not (clojure.string/includes? "\n  Waiter Error 503\n  ================\n  \n    Deployment error: Health check returned an invalid response\n  \n  Request Info\n  ============\n  \n            Host: 127.0.0.1:9091\n            Path: /endpoint\n    Query String: \n          Method: POST\n             CID: test-health-check-timed-out-4c6fdb35ae-3e4ff9a8cd5fc42f\n            Time: 2018-04-05T19:16:50.734Z\n       Principal: travis\n      Service Id: waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4\n  \n  \n  Additional Info\n  ===============\n  \n    {:service-id\n     \"waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4\",\n     :status 503}\n    \n  Getting Help\n  ============\n  \n    Waiter on GitHub: http://github.com/twosigma/waiter \n  \n" "Deployment error: Health check timed out"))
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-health-check-timed-out �[36m45s�[0m {:test 8, :pass 819, :fail 1, :error 0}
	 �[1m�[35mSTART: �[0m waiter.deployment-errors-test/test-invalid-health-check-response
	 �[34mFINISH:�[0m waiter.request-timeout-test/test-request-queue-timeout-faulty-app �[36m16s�[0m {:test 9, :pass 826, :fail 1, :error 0}

lein parallel-test waiter.busy-instance-test # on thread 0
	 �[1m�[35mSTART: �[0m waiter.busy-instance-test/test-busy-instance-not-reserved
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-invalid-health-check-response �[36m50s�[0m {:test 10, :pass 829, :fail 1, :error 0}
	 �[1m�[35mSTART: �[0m waiter.deployment-errors-test/test-cannot-connect
	 �[34mFINISH:�[0m waiter.busy-instance-test/test-busy-instance-not-reserved �[36m45s�[0m {:test 11, :pass 830, :fail 1, :error 0}

lein parallel-test waiter.new-app-test # on thread 0
	 �[1m�[35mSTART: �[0m waiter.new-app-test/test-new-app-gc
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-cannot-connect �[36m37s�[0m {:test 12, :pass 833, :fail 1, :error 0}

lein parallel-test waiter.work-stealing-integration-test # on thread 1
	 �[1m�[35mSTART: �[0m waiter.work-stealing-integration-test/test-work-stealing-load-balancing
	 �[34mFINISH:�[0m waiter.new-app-test/test-new-app-gc �[36m88s�[0m {:test 13, :pass 834, :fail 1, :error 0}
	 �[34mFINISH:�[0m waiter.work-stealing-integration-test/test-work-stealing-load-balancing �[36m111s�[0m {:test 13, :pass 837, :fail 1, :error 0}

Longest running tests:
waiter.streaming-test/test-streaming-timeout-on-default-settings �[36m169s�[0m
waiter.work-stealing-integration-test/test-work-stealing-load-balancing �[36m111s�[0m
waiter.new-app-test/test-new-app-gc �[36m88s�[0m
waiter.deployment-errors-test/test-invalid-health-check-response �[36m50s�[0m
waiter.deployment-errors-test/test-health-check-timed-out �[36m45s�[0m
waiter.busy-instance-test/test-busy-instance-not-reserved �[36m45s�[0m
waiter.autoscaling-test/test-scaling-healthy-app �[36m43s�[0m
waiter.deployment-errors-test/test-cannot-connect �[36m37s�[0m
waiter.autoscaling-test/test-scaling-unhealthy-app �[36m30s�[0m
waiter.instance-reservation-test/test-instance-reservation-for-concurrent-service �[36m27s�[0m

Ran 13 tests containing 838 assertions.
1 failures, 0 errors.
Tests failed.
Error encountered performing task 'parallel-test' with profile(s): 'base,system,user,provided,dev,test-log'
Tests failed.

Refactor handle-token-request and corresponding unit tests

Break handle-token-request to separate functions for the different request methods.
Refactor the corresponding unit tests to parse the json response and do assertions on the maps.

Invalid token header should report 400 instead of 500

We should validate the token using the waiter.token/valid-token-re regex and use a memoized function to minimize the overhead of this validation. Failing the validation should report a 400 Client error.

$ curl -i -H"x-waiter-token: invalid/token" http://localhost:9091/hello
HTTP/1.1 500 Server Error
Content-Type: text/plain
x-cid: e322bca34794-11337f17bb43b7fa
Transfer-Encoding: chunked

The internal error logs point to it being detected at the kv-store layer.

2018-09-12 15:16:13,663 ERROR waiter.util.utils [qtp413715958-77] - [CID=e322bca34794-11337f17bb43b7fa] #error {
 :cause Key may not contain '/'
 :data {:key invalid/token}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message Internal error
   :data {:status 500}
   :at [clojure.core$ex_info invokeStatic core.clj 4617]}
  {:type clojure.lang.ExceptionInfo
   :message Key may not contain '/'
   :data {:key invalid/token}
   :at [clojure.core$ex_info invokeStatic core.clj 4617]}]
 :trace
 [[clojure.core$ex_info invokeStatic core.clj 4617]
  [clojure.core$ex_info invoke core.clj 4617]
  [waiter.kv$validate_zk_key invokeStatic kv.clj 97]
  [waiter.kv$validate_zk_key invoke kv.clj 92]
  [waiter.kv.ZooKeeperKeyValueStore retrieve kv.clj 122]
  [waiter.kv.EncryptedKeyValueStore retrieve kv.clj 167]
  [waiter.kv.CachedKeyValueStore$fn__23591 invoke kv.clj 202]
  [waiter.util.utils$atom_cache_get_or_load$fn__16884 invoke utils.clj 90]
  [clojure.lang.Delay deref Delay.java 37]
  [clojure.core$deref invokeStatic core.clj 2228]
  [clojure.core$deref invoke core.clj 2214]
  [waiter.util.utils$atom_cache_get_or_load$fn__16886$fn__16887 invoke utils.clj 91]
  [clojure.core.cache$through$fn__5673 invoke cache.clj 55]
  [clojure.core.cache$default_wrapper_fn invokeStatic cache.clj 42]
  [clojure.core.cache$default_wrapper_fn invoke cache.clj 42]
  [clojure.core.cache$through invokeStatic cache.clj 55]
  [clojure.core.cache$through invoke cache.clj 44]
  [clojure.core.cache$through invokeStatic cache.clj 51]
  [clojure.core.cache$through invoke cache.clj 44]
  [waiter.util.utils$atom_cache_get_or_load$fn__16886 invoke utils.clj 91]
  [clojure.lang.Atom swap Atom.java 37]
  [clojure.core$swap_BANG_ invokeStatic core.clj 2260]
  [clojure.core$swap_BANG_ invoke core.clj 2253]
  [waiter.util.utils$atom_cache_get_or_load invokeStatic utils.clj 91]
  [waiter.util.utils$atom_cache_get_or_load invoke utils.clj 85]
  [waiter.kv.CachedKeyValueStore retrieve kv.clj 202]
  [waiter.kv$fetch invokeStatic kv.clj 54]
  [waiter.kv$fetch doInvoke kv.clj 50]
  [clojure.lang.RestFn invoke RestFn.java 425]
  [waiter.service_description$token__GT_token_data invokeStatic service_description.clj 497]
  [waiter.service_description$token__GT_token_data invoke service_description.clj 494]
  [waiter.service_description$token__GT_service_parameter_template invokeStatic service_description.clj 531]
  [waiter.service_description$token__GT_service_parameter_template doInvoke service_description.clj 528]
  [clojure.lang.RestFn invoke RestFn.java 464]
  [waiter.core$fn__41705$fnk41702_positional__41706$wrap_auth_bypass_fn__41707$fn__41709 invoke core.clj 1355]
  [waiter.core$ring_handler_factory$http_handler__37736 invoke core.clj 138]
  [waiter.cors$wrap_cors_preflight$wrap_cors_preflight_fn__18040 invoke cors.clj 61]
  [waiter.core$wrap_error_handling$wrap_error_handling_fn__37851 invoke core.clj 224]
  [waiter.core$wrap_debug$wrap_debug_fn__37840 invoke core.clj 217]
  [waiter.request_log$wrap_log$fn__43546 invoke request_log.clj 76]

Router crashes on marathon ConnectException

A ConnectException thrown at https://github.com/twosigma/waiter/blob/master/waiter/src/waiter/scheduler.clj#L348 causes the router to crash. The fix is probably to extend https://github.com/twosigma/waiter/blob/master/waiter/src/waiter/scheduler.clj#L107 to support ConnectException in addition to checking for 5xx status codes.

retrieve-hostname should return canonical host name for local runs

(defn retrieve-hostname []
  (let [username (retrieve-username)
        machine-name (execute-command "hostname")]
    (str username "." machine-name)))

should probably use:
(.getCanonicalHostName (InetAddress/getLocalHost))

test-remove-and-check-metrics-except-outstanding is flaky

Example:

     START:  waiter.metrics-test/test-remove-and-check-metrics-except-outstanding
lein test :only waiter.metrics-test/test-remove-and-check-metrics-except-outstanding
FAIL in (test-remove-and-check-metrics-except-outstanding) (metrics_test.clj:330)
Delete metrics for specified services
expected: 10
  actual: 12
     FINISH: waiter.metrics-test/test-remove-and-check-metrics-except-outstanding 28ms {:test 8, :pass 48, :fail 1, :error 0, :running 0}

Unit test running in REPL did not clean up resources and used up all disk space

After running unit tests in waiter/test/waiter/scheduler/composite_test.clj using the REPL in Intellij, some process kept logging to waiter/log/waiter.log.2018-10-18 until all disk space was used up

tail of waiter/log/waiter.log:

...
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  sending 0 services along scheduler-state-chan
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  ipsum state chan has been closed
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  sending 0 services along scheduler-state-chan
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  ipsum state chan has been closed
...

Add support in :shell scheduler for max aggregate cpu and mem

/waiter-async endpoints should not propagate authorization header to backend

They currently pass along whatever authorization header was provided by the client. This was discovered when some of our tests attempted to pass the authorization header on /waiter-async routes, and the header was forwarded to kitchen, which then failed to authenticate because it expects the WAITER_USERNAME and WAITER_PASSWORD it was configured with.

Token regex should be improved to disallow examples like `a......----....._`

(def ^:const valid-token-re #"[a-zA-Z]([a-zA-Z0-9\-_$\.])+")

waiter/waiter/src/waiter/token.clj

Line 30 in 67c7b08

(def ^:const valid-token-re #"[a-zA-Z]([a-zA-Z0-9\-_$\.])+")

Update documentation to reflect Kubernetes compatibility

@shamsimam mentioned this in #285.

Waiter is no longer Mesos-exclusive. We should reflect that in our documentation.

The following is a not-necessarily exhastive list of files to update:

/README.md
/waiter/README.md
/waiter/docs/service-description.md

`test-retry-failed-instances` is flaky

lein test :only waiter.scheduler.shell-test/test-retry-failed-instances                                                                                    
                                                                                                                                                                              
FAIL in (test-retry-failed-instances) (shell_test.clj:549)                                                                                              
expected: 1                                                                                                                                                              
  actual: 0                                                                                                                                                         
                                                                                                                                                                       
lein test :only waiter.scheduler.shell-test/test-retry-failed-instances                                                                                                        
                                                                                                                                                      
FAIL in (test-retry-failed-instances) (shell_test.clj:554)                                                                                                     
expected: 2                                                                                                                                               
  actual: 1

Look at events for messages about liveness probe failures

For now, we assume any SIGKILL (137) with the default "Error" reason was a livenessProbe kill.

This is used for setting the #{:never-passed-health-checks} flag.

(defn killed-by-k8s?
  ...
  (api-request http-client "/api/v1/namespaces/<ns>/events?fieldSelector=involvedObject.namespace=<ns>,involvedObject.name=<instance-id>,reason=Unhealthy")
  (-> event :message (string/starts-with? "Liveness probe failed:"))

Inter-router requests should use accept=application/json where appropriate

When providing run-as-user in the /apps endpoint, the view should still be limited to the services the user can manage.

This is the relevant method that needs to be changed: waiter.handler/list-services-handler.

There is a unit test that allows for this behavior, which means we added that intentionally. We will need to discuss if this behavior needs to be changed.

Anonymous auth should use run-as-user instead of launch-as-user

To be consistent with the other parts of the code

test-blacklisted-instance-not-reserved flaking

I've seen this multiple times today:

lein parallel-test :only waiter.killed-instance-test/test-blacklisted-instance-not-reserved

FAIL in (test-blacklisted-instance-not-reserved) (killed_instance_test.clj:58)
test-blacklisted-instance-not-reserved test-blacklisted-instance-not-reserved
[CID=unknown] Expected status: 503, actual:·
 Body:
expected: (clojure.core/= 503 actual-status__17762__auto__)
  actual: (not (clojure.core/= 503 nil))

Log capture for Kubernetes tests in Travis

We currently have no way to capture and dump the output from Kubernetes pods running in our Travis test jobs. I found an example of a pretty simple setup (using fluentd) that would let us grab stdout/stderr on all running pods and cat them all into a single long-running system pod:

https://github.com/joatmon08/kubernetes-reference/tree/master/logging

We could then use kubectl logs to dump all the output to a file for debugging failures.

We should also ensure that Kitchen defaults to use stream=sys.stderr for logging when it's running in our test container on Kubernetes.

`test-request-queue-timeout-slow-start-app` is flaky with kubernetes scheduler

lein parallel-test :only waiter.request-timeout-test/test-request-queue-timeout-slow-start-app
FAIL in (test-request-queue-timeout-slow-start-app) (request_timeout_test.clj:55)
test-request-queue-timeout-slow-start-app
expected: (clojure.string/includes? body "Check that your service is able to start properly!")
  actual: (not (clojure.string/includes? "\n  Waiter Error 503\n  ================\n  \n    After 10 seconds, no instance available to handle request.\n  \n  Request Info\n  ============\n  \n            Host: 127.0.0.1:9091\n            Path: /req\n    Query String: \n          Method: POST\n             CID: test-request-queue-timeout-slow-start-app-b6a8186bb9-113a633c93c46be1\n            Time: 2019-01-20T16:49:28.869Z\n       Principal: travis\n      Service Id: waiter-service-testrequestqueuetimeoutslowstartapptravis1291530-f0735b4b12fa87ece3afa084afa0a2a3\n  \n  \n  Additional Info\n  ===============\n  \n    {:outstanding-requests 1,\n     :service-id\n     \"waiter-service-testrequestqueuetimeoutslowstartapptravis1291530-f0735b4b12fa87ece3afa084afa0a2a3\",\n     :work-stealing-offers-sent 0,\n     :work-stealing-offers-received 0,\n     :slots-assigned 0,\n     :slots-in-use 0,\n     :waiting-for-available-instance 1,\n     :status 503,\n     :slots-available 0,\n     :requests-waiting-to-stream 0}\n    \n  Getting Help\n  ============\n  \n    Waiter on GitHub: http://github.com/twosigma/waiter \n  \n" "Check that your service is able to start properly!"))

async-utils/chan? fails potentially due to classloader issues

We've experienced intermittent calls to async-utils/chan? failing even if the class names match.

Integration tests shouldn't assume specific formatting of instance-ids

The integration tests currently assume a specific formatting for instance-ids to implement a instance-id->service-id mapping:

https://github.com/twosigma/waiter/blob/d7b7e86/waiter/src/waiter/client_tools.clj#L174-L175

We should have a more generic way to do this. Steven said we can probably just grab the service-id out of the response headers (the same place we currently get the instance-id).

Support user-specific cpu and mem share computation in cook.mesos.optimizer/create-optimizer-server

Alternative try-let syntax

We talked about using an alternate try-let syntax yesterday. Here's what I came up with:

(defrecord PassThru [caught-exception])

(defmacro try-let
  "Convenience macro for wrapping let binding expressions in a try/catch/finally block,
   while not catching exceptions thrown within the body of the let block.
   This construct adds optional :catch and :finally forms to the end of the let-binding vector.
   Each clause consists of a keyword-vector pair, similar to the :let clause in Clojure's `for` macro.
   The :catch vector must contain one or more forms,
   each of which should look like a `(catch ...)` clause, but without the `catch` keyword.
   The :finally vector's contents are simply wrapped in a `(finally ...)` block.
   At most one of each :catch and :finally may be provided,
   and if both are given, :catch must immedialy precede :finally.

   Example:

   (try-let [x (throw (RuntimeException. \"Hi\"))
             :catch [(IllegalArgumentException e (comment Cannot happen))
                     (Exception e (println \"Caught exception:\" (.getMessage e)))]
             :finally [(println \"doing cleanup\")
                       (comment Free some resources)]] x)"
  [bindings-vec & body]
  (assert (even? (count bindings-vec))
          "Odd number of forms in try-let binding vector. Binding-expression pairs are required.")
  (let [[binding-pairs catch+finally] (->> bindings-vec
                                           (partition 2)
                                           (split-with (comp not keyword? first)))
        bindings' (apply concat binding-pairs)
        clauses (->> catch+finally (apply concat) (apply array-map))
        {catches-vec :catch finally-body :finally} clauses
        clause-count (count catch+finally)]
    (assert (and (<= 1 clause-count 2)
                 (== (count clauses) clause-count)
                 (or (nil? catches-vec) (= :catch (ffirst catch+finally))))
            "Bindings vector must end with a :catch clause, or a :finally clause, or a :catch clause followed by a :finally clause.")
    `(let [try-result# (try
                         (let [~@bindings']
                           (try
                             ~@body
                             (catch java.lang.Throwable t#
                               (PassThru. t#))))
                         ~@(map (partial list* `catch) catches-vec)
                         ~@(when finally-body
                             `((finally ~@finally-body))))]
       (if (instance? PassThru try-result#)
         (throw (:caught-exception try-result#))
         try-result#))))

Sample usage (copied from the docstring):

   (try-let [x (throw (RuntimeException. "Hi"))
             :catch [(IllegalArgumentException e (comment Cannot happen))
                     (Exception e (println "Caught exception:" (.getMessage e)))]
             :finally [(println "doing cleanup")
                       (comment Free some resources)]] x)

Advantages of this version of the macro:

There is very little additional overhead for this solution in the case that no exception is thrown in the try-let body. (The macro in the currently-used library has to create an auxiliary vector, and then it repeats the destructuring and binding logic in both the inner and outer scopes.)
The :catch/:finally definitions go in the bindings list. That syntax feels idiomatic since it resembles :when/:let in the for macro, and it clearly denotes the intended scope of the :catch clauses.
The macro definition is pretty straightforward, and trivially supports destructuring (since it just plops the binding forms unchanged into another let block).

Disadvantages:

Catches and re-throws exceptions coming from the try-let body. However, the overhead of creating the exception (especially constructing the stack trace) should completely dominate the catch/throw.
Adds the PassThru record definition.
Some people might think having the :catch and :finally pieces in the binding list is "ugly"...

Full directory-contents support for Waiter-K8s

We should expose the files inside the working directory of the pods via Waiter as we do with the Marathon scheduler implementation.

We could do this by starting a little python file server in the home directory. Upon receiving a directory-contents request, we'd start the embedded file server with some reasonable timeout (e.g., 5 minutes), and if no requests come within that period, then we kill the file server process.

Add logging to kitchen while requests are failing authentication

Implement better model in cook.mesos.optimizer/calculate-expected-complete-time

client-utils scale-service-to exception handling should work with all schedulers

If scale-service-to fails, it tries to run Marathon-specific code to clean up the service. This should work with all of our schedulers, or at least have some equivalent k8s code as well.

prestashed-tickets state should be displayed in the `/state/prestashed-tickets` endpoint

Using `binding` inside an `async/go` block is not generally safe

Consider the following example¹:

(def ^:dynamic *foo* :original)

(defn -main [& _]
  (let [c (clojure.core.async/to-chan [42])]
    (clojure.core.async/go
      (binding [*foo* :rebound]
        (clojure.core.async/<! c)
        (println "done."))))

  (Thread/sleep 5000))

The go macro transforms its body to create a state machine that can be paused/resumed on parking operations. That state machine is then executed on a worker thread pool. Since the <! operation in the above example causes the state machine to park (i.e., yield), the worker thread that entered the binding block and pushed the new *foo* value onto its dynamic-bindings stack might not be the same thread that resumes the state machine (after the <! operation completes), which means that the dynamic-bindings pop operation at the end of the binding block can fail loudly:

Exception in thread "async-dispatch-3" java.lang.IllegalStateException: Pop without matching push

This is a known issue. The issue description says that it only applies to ahead-of-time-compiled code (e.g., apps run from an UberJAR)—but it's not clear to me why it wouldn't also apply to non-AOT-compiled code.

We've seen these errors cause Waiter to crash in our production environment. We should try our best to avoid using binding blocks within async/go blocks. If we can't avoid it, then we must ensure that there are no parking actions within the binding block.

[1]: Example adapted from the proof-of-concept in https://dev.clojure.org/jira/browse/ASYNC-170.

Add "Introduction to Waiter" to the docs

A rough outline for this page:

Batch Jobs Need Services

Many batch jobs repeating the same data loading / manipulation tasks
Desire to break out these tasks into shared services
The need for per-user services

Waiter Features

Simple service creation
Run-as-requestor
Scaling, up and down to zero
“Serverless”
Supports wide variety of request types and services
- request duration
- start-up time
- concurrency guarantees

Waiter’s Value

Rough Usage Metrics
Long-lived services, e.g. BeakerX (websockets)

A previous closed PR: #243

Gracefully handle release of non-existing instance

If a service/instance gets deleted, the release mechanism throws an error:

2019-03-14 16:15:44,030 INFO  waiter.scheduler.marathon [pool-1-thread-7] - [CID=test-request-parallel-streaming-dda564e72b-5e50d491a39bbd21] deleting service waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97

2019-03-14 16:20:02,913 ERROR waiter.service [async-dispatch-44] - [CID=test-request-parallel-streaming-a57d509d95-d9e244d9521fab3] Error while releasing instance #waiter.scheduler.ServiceInstance{:id waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97.d8c7641c-4673-11e9-b175-0242ac110005, :service-id waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97, :started-at #clj-time/date-time "2019-03-14T16:11:38.563Z", :healthy? true, :health-check-status nil, :flags #{}, :exit-code nil, :host 172.17.0.7, :port 31859, :extra-ports [], :protocol http, :log-directory nil, :message nil}

clojure.lang.ExceptionInfo: Unable to find release-chan. {:instance #waiter.scheduler.ServiceInstance{:id "waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97.d8c7641c-4673-11e9-b175-0242ac110005", :service-id "waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97", :started-at #clj-time/date-time "2019-03-14T16:11:38.563Z", :healthy? true, :health-check-status nil, :flags #{}, :exit-code nil, :host "172.17.0.7", :port 31859, :extra-ports [], :protocol "http", :log-directory nil, :message nil}}
	at waiter.service$release_instance_go$fn__25178$state_machine__10453__auto____25185$fn__25188.invoke(service.clj:175)
	at waiter.service$release_instance_go$fn__25178$state_machine__10453__auto____25185.invoke(service.clj:175)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:973)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:972)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:977)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:975)
	at clojure.core.async.impl.ioc_macros$take_BANG_$fn__10471.invoke(ioc_macros.clj:986)
	at clojure.core.async.impl.channels.ManyToManyChannel$fn__5480.invoke(channels.clj:265)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)