Git Product home page Git Product logo

waiter's Introduction

We've decided to archive the Waiter project — it will remain available on GitHub in archive mode, but active development has ceased. We're pleased to have had the opportunity to share Waiter with the OSS community over the last five years, and are grateful for your contributions.

We continue to support other OSS programs. You can find out more about our other projects and contributions at https://www.twosigma.com/open-source/.

Waiter

Slack Status

Welcome to Two Sigma's Waiter project!

Waiter is a web service platform that runs, manages, and automatically scales services without requiring human intervention.

Waiter Design is a good place to start to learn more.

Subproject Summary

In this repository, you'll find two subprojects, each with its own documentation.

  • waiter - This is the actual web service platform, Waiter. It comes with a JSON REST API.
  • kitchen - This is the kitchen application, a test app used by the Waiter integration tests.

Please visit the waiter subproject first to get started.

Quickstart

The quickest way to get Mesos, Marathon, and Waiter running locally is with docker and minimesos.

  1. Install docker
  2. Install minimesos
  3. Clone down this repo
  4. Run containers/bin/build-docker-images.sh to build the minimesos agent image with kitchen and other test apps baked in
  5. cd waiter
  6. Run minimesos up to start Mesos, ZooKeeper, and Marathon
  7. Run bin/run-using-minimesos.sh to start Waiter
  8. Waiter should now be listening locally on port 9091

Quickstart (local shell scheduling)

Waiter can also be run without Mesos and Marathon, using the "shell scheduler". Note that this scheduler should only be used for testing purposes, not in production.

  1. Ensure setsid is installed on your system and on your path
  2. Clone down this repo
  3. cd waiter
  4. Run bin/run-using-shell-scheduler.sh to start Waiter
  5. Waiter should now be listening locally on port 9091

Contributing

In order to accept your code contributions, please fill out the appropriate Contributor License Agreement in the cla folder and submit it to [email protected].

Disclaimer

Apache Mesos is a trademark of The Apache Software Foundation. The Apache Software Foundation is not affiliated, endorsed, connected, sponsored or otherwise associated in any way to Two Sigma, Waiter, or this website in any manner.

© 2017-2020 Two Sigma Open Source, LLC

waiter's People

Contributors

amaheshwari25 avatar blevz avatar chzou avatar daowen avatar dependabot[bot] avatar derik01 avatar dmed256 avatar dposada avatar geofft avatar jmeinwald avatar kevo1ution avatar mokshjawa avatar napple avatar nsinkov avatar pschorf avatar scrosby avatar shamsimam avatar sradack avatar twosigmajab avatar zaurbek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

waiter's Issues

Missing Python 3.6 in Travis-CI container

The Kitchen tests are no longer working because the pyenv setup is failing. It looks like Travis has removed Python 3.6 from the container image we're using (trusty deprecated-2017Q4):

$ cd kitchen && ./bin/ci/setup.sh
+pyenv global 3.6
pyenv: version `3.6' not installed
+python --version
Python 2.7.6
+python3 --version
Python 3.4.3

204 No Content with non-zero Content-Length causes 500 in Waiter

The following service causes Waiter to return a 500 error on the client side; however, the request log shows a 204 status code, and nothing about the error pops up in the normal waiter log.

{
 "name": "nc204-app",
 "cpus": 0.1,
 "mem": 128,
 "ports": 1,
 "version": "v4",
 "cmd": "ncat -klp $PORT0 -c 'printf \"HTTP/1.1 204 No Content\r\nContent-Type: text/plain\r\nContent-Length: 1\r\n\r\nX\"'",
 "health-check-url": "/status",
 "token": "nc204-app",
 "permitted-user": "*",
 "idle-timeout-mins": 30,
 "instance-expiry-mins": 10,
 "health-check-interval-secs": 5
}

We should consider either finding a way to pass this through (maybe after dropping the content), or log the error and the actual return code.

Do not consume K8s pod restart counts

The restartCount value on a pod's containers is not reliable. According to the current API documentation:

The number of times the container has been restarted, currently based on the number of dead containers that have not yet been removed. Note that this is calculated from dead containers. But those containers are subject to garbage collection. This value will get capped at 5 by GC.

The number 5 doesn't seem to be enforced in practice (probably because the API for setting that number was actually deprecated a while ago); however, it's still the case that the number is not reliable.

We should switch to using a more reliable value everywhere that we use the restart count. The container start time might be a good candidate.

Split Authenticator to decouple orthogonal functions

Our current authenticator interface handles two orthogonal checks:

  1. HTTP client authentication (e.g., basic auth or kerberos)
  2. Permission-checking for service creation (e.g., checking for prestashed kerberos tickets in the target compute cluster)

These should be decoupled to make reuse of the client authentication portion easier when we need to change the permission-checking portion.

waiter.metrics-test/test-service-counter is flaky

START:  waiter.metrics-test/test-service-counter
lein test :only waiter.metrics-test/test-service-counter
FAIL in (test-service-counter) (metrics_test.clj:65)
expected: (every? (fn* [p1__55911#] (str/starts-with? p1__55911# "services.service-id.")) (.getNames mc/default-registry))
  actual: (not (every? #object[waiter.metrics_test$fn__55912$fn__55915$fn__55916 0x79a6f498 "waiter.metrics_test$fn__55912$fn__55915$fn__55916@79a6f498"] #{"services.service-id.counters.fee.fie" "services.service-id.counters.foo" "services.service-id.counters.foo.bar" "waiter.interstitial.counters.promise.resolved" "waiter.interstitial.counters.resolution.interstitial-timeout"}))
lein test :only waiter.metrics-test/test-service-counter
FAIL in (test-service-counter) (metrics_test.clj:66)
expected: 3
  actual: 5
	 FINISH: waiter.metrics-test/test-service-counter 12ms {:test 21, :pass 93, :fail 2, :error 0, :running 0}

interstitial page unit tests updating metrics asynchronously?

I just found a very perplexing failure on Travis in our Waiter Unit Test logs:

lein test :only waiter.metrics-test/test-waiter-timer

FAIL in (test-waiter-timer) (metrics_test.clj:140)
expected: (every? (fn* [p1__56780#] (str/starts-with? p1__56780# "waiter.core.")) (.getNames mc/default-registry))
  actual: (not (every? #object[waiter.metrics_test$fn__56781$fn__56790$fn__56791 0x6aacbac9 "waiter.metrics_test$fn__56781$fn__56790$fn__56791@6aacbac9"] #{"waiter.core.timers.fee.fie" "waiter.core.timers.foo" "waiter.core.timers.foo.bar" "waiter.interstitial.counters.promise.resolved" "waiter.interstitial.counters.resolution.interstitial-timeout"}))

Here's the corresponding test code:

(deftest test-waiter-timer
  (let [all-metrics-match-filter (reify MetricFilter (matches [_ _ _] true))]
    (.removeMatching mc/default-registry all-metrics-match-filter)
    (waiter-timer "core" "foo")
    (waiter-timer "core" "foo" "bar")
    (waiter-timer "core" "fee" "fie")
    (is (every? #(str/starts-with? % "waiter.core.") (.getNames mc/default-registry)))
    (= 3 (count (.getTimers mc/default-registry all-metrics-match-filter)))
    (.removeMatching mc/default-registry all-metrics-match-filter)))

That code looks very predictable. The only scenario I can come up with where some "waiter.interstitial" metrics get injected between the call to .removeMatching and the call to .getTimers is if some previous unit test for interstitial pages started some asynchronous thread that kept running after the test completed, and then updated "waiter.interstitial.counters.promise.resolved" etc. asynchronously in the middle of this completely unrelated metrics test.

test-health-check-timed-out is flakey

The test-health-check-timed-out test has been flaking out. Shams said he's seen it fail many times. I saw it fail twice in a row today in the shell-slow integration suite.

I'm including part of the log below. I have the full log saved if needed.

lein parallel-test :only waiter.deployment-errors-test/test-health-check-timed-out

FAIL in (test-health-check-timed-out) (deployment_errors_test.clj:89)
test-health-check-timed-out
router->service-state:
{"2301af2d04-586356b53c4f6dcf"
 {:router-id "2301af2d04-586356b53c4f6dcf",
  :state
  {:scheduler-state
   {:id->instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:16:50.759Z",
      :port 10010,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :host "127.0.0.6",
      :killed? true,
      :pid 5907,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :flags ["never-passed-health-checks"],
      :failed? true},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:10.342Z",
      :port 10002,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :host "127.0.0.4",
      :killed? true,
      :pid 5949,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :flags ["never-passed-health-checks"],
      :failed? true},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags []}},
    :instance-id->failed-health-check-count
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     1},
    :instance-id->tracked-failed-instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:16:50.759Z",
      :process "java.lang.UNIXProcess@72cfbadb",
      :port 10010,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :host "127.0.0.6",
      :pid 5907,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.4c718e104a-5008eef843df6142",
      :flags
      ["has-responded" "has-connected" "never-passed-health-checks"]},
     :waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:10.342Z",
      :process "java.lang.UNIXProcess@4469b76e",
      :port 10002,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :host "127.0.0.4",
      :pid 5949,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.5100b51fc7-522c8546277e7abe",
      :flags
      ["has-responded" "has-connected" "never-passed-health-checks"]}},
    :instance-id->unhealthy-instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags ["has-connected"]}},
    :last-update-time "2018-04-05T19:17:35.367Z",
    :service
    {:environment
     {:HOME "/home/travis/build/twosigma/waiter/waiter/scheduler",
      :LOGNAME "travis",
      :USER "travis",
      :WAITER_CPUS "0.1",
      :WAITER_MEM_MB "256",
      :WAITER_PASSWORD "dec7c87f26cd47c06e4213271b80432a",
      :WAITER_USERNAME "waiter"},
     :id
     "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
     :instances 1,
     :service-description
     {:max-queue-length 1000000,
      :idle-timeout-mins 10,
      :mem 256,
      :min-instances 1,
      :interstitial-secs 0,
      :name "testhealthchecktimedouttravis341824",
      :grace-period-secs 15,
      :env {},
      :max-instances 500,
      :cmd-type "shell",
      :scale-up-factor 0.1,
      :concurrency-level 1,
      :permitted-user "travis",
      :run-as-user "travis",
      :ports 1,
      :health-check-max-consecutive-failures 1,
      :authentication "standard",
      :health-check-interval-secs 5,
      :scale-down-factor 0.001,
      :restart-backoff-factor 2,
      :instance-expiry-mins 7200,
      :cmd
      "/home/travis/build/twosigma/waiter/waiter/bin/ci/../../../kitchen/bin/run.sh -p $PORT0",
      :distribution-scheme "balanced",
      :scale-factor 1,
      :version "version-does-not-matter",
      :health-check-url "/sleep?sleep-ms=300000&status=400",
      :blacklist-on-503 true,
      :metadata {},
      :jitter-threshold 0.5,
      :expired-instance-restart-rate 0.1,
      :backend-proto "http",
      :metric-group "waiter_kitchen",
      :cpus 0.1},
     :task-count 1,
     :task-stats {:healthy 0, :running 1, :staged 0, :unhealthy 1},
     :mem 256}},
   :autoscaler-state
   {:expired-instances 0,
    :healthy-instances 0,
    :instances 1,
    :outstanding-requests 1,
    :scale-amount 0,
    :scale-to-instances 1,
    :target-instances 1.0,
    :task-count 1},
   :scheduler-services-gc-state {},
   :autoscaling-multiplexer-state "no-data-available",
   :app-maintainer-state
   {:last-state-update-time "2018-04-05T19:17:35.367Z",
    :maintainer-chan-available true},
   :responder-state
   {:deployment-error "invalid-health-check-response",
    :request-id->work-stealer {},
    :work-stealing-queue [],
    :instance-id->request-id->use-reason-map {},
    :instance-id->consecutive-failures {},
    :instance-id->blacklist-expiry-time {},
    :instance-id->state
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:slots-assigned 0,
      :slots-used 0,
      :status-tags ["starting" "unhealthy"]}},
    :id->instance
    {:waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f
     {:extra-ports [],
      :service-id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4",
      :protocol "http",
      :started-at "2018-04-05T19:17:30.348Z",
      :process "java.lang.UNIXProcess@50093915",
      :port 10001,
      :log-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :host "127.0.0.3",
      :pid 5988,
      :id
      "waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :healthy? false,
      :working-directory
      "/home/travis/build/twosigma/waiter/waiter/scheduler/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4/waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f",
      :flags ["has-connected"]}},
    :sorted-instance-ids
    ["waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4.55a915066a-5c47a4c1dc5b114f"]},
   :transient-metrics-gc-state
   {:last-modified-time "2018-04-05T19:16:55.250Z",
    :state {:alive? false, :outstanding 1, :total 1}},
   :local-usage {:last-request-time "2018-04-05T19:16:50.734Z"},
   :scheduler-broken-services-gc-state {},
   :interstitial-maintainer-state {:available true},
   :work-stealing-state
   {:iteration 399,
    :request-id->work-stealer {},
    :router-id->help-required {},
    :router-id->metrics
    {:2301af2d04-586356b53c4f6dcf
     {:last-request-time "2018-04-05T19:16:50.734Z",
      :outstanding 0,
      :slots-available 0,
      :slots-in-use 0,
      :slots-received 0,
      :total 1}},
    :slots {:offerable 0, :offered 0}}}}}

expected: (clojure.string/includes? body__43965__auto__ (waiter.deployment-errors-test/deployment-error->str waiter-url :health-check-timed-out))
  actual: (not (clojure.string/includes? "\n  Waiter Error 503\n  ================\n  \n    Deployment error: Health check returned an invalid response\n  \n  Request Info\n  ============\n  \n            Host: 127.0.0.1:9091\n            Path: /endpoint\n    Query String: \n          Method: POST\n             CID: test-health-check-timed-out-4c6fdb35ae-3e4ff9a8cd5fc42f\n            Time: 2018-04-05T19:16:50.734Z\n       Principal: travis\n      Service Id: waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4\n  \n  \n  Additional Info\n  ===============\n  \n    {:service-id\n     \"waiter-service-testhealthchecktimedouttravis341824-ab0f694fb3515294a9df919e1be7b7a4\",\n     :status 503}\n    \n  Getting Help\n  ============\n  \n    Waiter on GitHub: http://github.com/twosigma/waiter \n  \n" "Deployment error: Health check timed out"))
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-health-check-timed-out �[36m45s�[0m {:test 8, :pass 819, :fail 1, :error 0}
	 �[1m�[35mSTART: �[0m waiter.deployment-errors-test/test-invalid-health-check-response
	 �[34mFINISH:�[0m waiter.request-timeout-test/test-request-queue-timeout-faulty-app �[36m16s�[0m {:test 9, :pass 826, :fail 1, :error 0}

lein parallel-test waiter.busy-instance-test # on thread 0
	 �[1m�[35mSTART: �[0m waiter.busy-instance-test/test-busy-instance-not-reserved
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-invalid-health-check-response �[36m50s�[0m {:test 10, :pass 829, :fail 1, :error 0}
	 �[1m�[35mSTART: �[0m waiter.deployment-errors-test/test-cannot-connect
	 �[34mFINISH:�[0m waiter.busy-instance-test/test-busy-instance-not-reserved �[36m45s�[0m {:test 11, :pass 830, :fail 1, :error 0}

lein parallel-test waiter.new-app-test # on thread 0
	 �[1m�[35mSTART: �[0m waiter.new-app-test/test-new-app-gc
	 �[34mFINISH:�[0m waiter.deployment-errors-test/test-cannot-connect �[36m37s�[0m {:test 12, :pass 833, :fail 1, :error 0}

lein parallel-test waiter.work-stealing-integration-test # on thread 1
	 �[1m�[35mSTART: �[0m waiter.work-stealing-integration-test/test-work-stealing-load-balancing
	 �[34mFINISH:�[0m waiter.new-app-test/test-new-app-gc �[36m88s�[0m {:test 13, :pass 834, :fail 1, :error 0}
	 �[34mFINISH:�[0m waiter.work-stealing-integration-test/test-work-stealing-load-balancing �[36m111s�[0m {:test 13, :pass 837, :fail 1, :error 0}

Longest running tests:
waiter.streaming-test/test-streaming-timeout-on-default-settings �[36m169s�[0m
waiter.work-stealing-integration-test/test-work-stealing-load-balancing �[36m111s�[0m
waiter.new-app-test/test-new-app-gc �[36m88s�[0m
waiter.deployment-errors-test/test-invalid-health-check-response �[36m50s�[0m
waiter.deployment-errors-test/test-health-check-timed-out �[36m45s�[0m
waiter.busy-instance-test/test-busy-instance-not-reserved �[36m45s�[0m
waiter.autoscaling-test/test-scaling-healthy-app �[36m43s�[0m
waiter.deployment-errors-test/test-cannot-connect �[36m37s�[0m
waiter.autoscaling-test/test-scaling-unhealthy-app �[36m30s�[0m
waiter.instance-reservation-test/test-instance-reservation-for-concurrent-service �[36m27s�[0m

Ran 13 tests containing 838 assertions.
1 failures, 0 errors.
Tests failed.
Error encountered performing task 'parallel-test' with profile(s): 'base,system,user,provided,dev,test-log'
Tests failed.

Invalid token header should report 400 instead of 500

We should validate the token using the waiter.token/valid-token-re regex and use a memoized function to minimize the overhead of this validation. Failing the validation should report a 400 Client error.

$ curl -i -H"x-waiter-token: invalid/token" http://localhost:9091/hello
HTTP/1.1 500 Server Error
Content-Type: text/plain
x-cid: e322bca34794-11337f17bb43b7fa
Transfer-Encoding: chunked

The internal error logs point to it being detected at the kv-store layer.

2018-09-12 15:16:13,663 ERROR waiter.util.utils [qtp413715958-77] - [CID=e322bca34794-11337f17bb43b7fa] #error {
 :cause Key may not contain '/'
 :data {:key invalid/token}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message Internal error
   :data {:status 500}
   :at [clojure.core$ex_info invokeStatic core.clj 4617]}
  {:type clojure.lang.ExceptionInfo
   :message Key may not contain '/'
   :data {:key invalid/token}
   :at [clojure.core$ex_info invokeStatic core.clj 4617]}]
 :trace
 [[clojure.core$ex_info invokeStatic core.clj 4617]
  [clojure.core$ex_info invoke core.clj 4617]
  [waiter.kv$validate_zk_key invokeStatic kv.clj 97]
  [waiter.kv$validate_zk_key invoke kv.clj 92]
  [waiter.kv.ZooKeeperKeyValueStore retrieve kv.clj 122]
  [waiter.kv.EncryptedKeyValueStore retrieve kv.clj 167]
  [waiter.kv.CachedKeyValueStore$fn__23591 invoke kv.clj 202]
  [waiter.util.utils$atom_cache_get_or_load$fn__16884 invoke utils.clj 90]
  [clojure.lang.Delay deref Delay.java 37]
  [clojure.core$deref invokeStatic core.clj 2228]
  [clojure.core$deref invoke core.clj 2214]
  [waiter.util.utils$atom_cache_get_or_load$fn__16886$fn__16887 invoke utils.clj 91]
  [clojure.core.cache$through$fn__5673 invoke cache.clj 55]
  [clojure.core.cache$default_wrapper_fn invokeStatic cache.clj 42]
  [clojure.core.cache$default_wrapper_fn invoke cache.clj 42]
  [clojure.core.cache$through invokeStatic cache.clj 55]
  [clojure.core.cache$through invoke cache.clj 44]
  [clojure.core.cache$through invokeStatic cache.clj 51]
  [clojure.core.cache$through invoke cache.clj 44]
  [waiter.util.utils$atom_cache_get_or_load$fn__16886 invoke utils.clj 91]
  [clojure.lang.Atom swap Atom.java 37]
  [clojure.core$swap_BANG_ invokeStatic core.clj 2260]
  [clojure.core$swap_BANG_ invoke core.clj 2253]
  [waiter.util.utils$atom_cache_get_or_load invokeStatic utils.clj 91]
  [waiter.util.utils$atom_cache_get_or_load invoke utils.clj 85]
  [waiter.kv.CachedKeyValueStore retrieve kv.clj 202]
  [waiter.kv$fetch invokeStatic kv.clj 54]
  [waiter.kv$fetch doInvoke kv.clj 50]
  [clojure.lang.RestFn invoke RestFn.java 425]
  [waiter.service_description$token__GT_token_data invokeStatic service_description.clj 497]
  [waiter.service_description$token__GT_token_data invoke service_description.clj 494]
  [waiter.service_description$token__GT_service_parameter_template invokeStatic service_description.clj 531]
  [waiter.service_description$token__GT_service_parameter_template doInvoke service_description.clj 528]
  [clojure.lang.RestFn invoke RestFn.java 464]
  [waiter.core$fn__41705$fnk41702_positional__41706$wrap_auth_bypass_fn__41707$fn__41709 invoke core.clj 1355]
  [waiter.core$ring_handler_factory$http_handler__37736 invoke core.clj 138]
  [waiter.cors$wrap_cors_preflight$wrap_cors_preflight_fn__18040 invoke cors.clj 61]
  [waiter.core$wrap_error_handling$wrap_error_handling_fn__37851 invoke core.clj 224]
  [waiter.core$wrap_debug$wrap_debug_fn__37840 invoke core.clj 217]
  [waiter.request_log$wrap_log$fn__43546 invoke request_log.clj 76]

test-remove-and-check-metrics-except-outstanding is flaky

Example:

     START:  waiter.metrics-test/test-remove-and-check-metrics-except-outstanding
lein test :only waiter.metrics-test/test-remove-and-check-metrics-except-outstanding
FAIL in (test-remove-and-check-metrics-except-outstanding) (metrics_test.clj:330)
Delete metrics for specified services
expected: 10
  actual: 12
     FINISH: waiter.metrics-test/test-remove-and-check-metrics-except-outstanding 28ms {:test 8, :pass 48, :fail 1, :error 0, :running 0}

Unit test running in REPL did not clean up resources and used up all disk space

After running unit tests in waiter/test/waiter/scheduler/composite_test.clj using the REPL in Intellij, some process kept logging to waiter/log/waiter.log.2018-10-18 until all disk space was used up

tail of waiter/log/waiter.log:

...
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  sending 0 services along scheduler-state-chan
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  ipsum state chan has been closed
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  sending 0 services along scheduler-state-chan
2018-10-19 09:47:51,751 INFO  waiter.scheduler.composite [async-dispatch-17] -  ipsum state chan has been closed
...

/waiter-async endpoints should not propagate authorization header to backend

They currently pass along whatever authorization header was provided by the client. This was discovered when some of our tests attempted to pass the authorization header on /waiter-async routes, and the header was forwarded to kitchen, which then failed to authenticate because it expects the WAITER_USERNAME and WAITER_PASSWORD it was configured with.

`test-retry-failed-instances` is flaky

lein test :only waiter.scheduler.shell-test/test-retry-failed-instances                                                                                    
                                                                                                                                                                              
FAIL in (test-retry-failed-instances) (shell_test.clj:549)                                                                                              
expected: 1                                                                                                                                                              
  actual: 0                                                                                                                                                         
                                                                                                                                                                       
lein test :only waiter.scheduler.shell-test/test-retry-failed-instances                                                                                                        
                                                                                                                                                      
FAIL in (test-retry-failed-instances) (shell_test.clj:554)                                                                                                     
expected: 2                                                                                                                                               
  actual: 1   

Look at events for messages about liveness probe failures

For now, we assume any SIGKILL (137) with the default "Error" reason was a livenessProbe kill.

This is used for setting the #{:never-passed-health-checks} flag.

(defn killed-by-k8s?
  ...
  (api-request http-client "/api/v1/namespaces/<ns>/events?fieldSelector=involvedObject.namespace=<ns>,involvedObject.name=<instance-id>,reason=Unhealthy")
  (-> event :message (string/starts-with? "Liveness probe failed:"))

test-blacklisted-instance-not-reserved flaking

I've seen this multiple times today:

lein parallel-test :only waiter.killed-instance-test/test-blacklisted-instance-not-reserved

FAIL in (test-blacklisted-instance-not-reserved) (killed_instance_test.clj:58)
test-blacklisted-instance-not-reserved test-blacklisted-instance-not-reserved
[CID=unknown] Expected status: 503, actual:·
 Body:
expected: (clojure.core/= 503 actual-status__17762__auto__)
  actual: (not (clojure.core/= 503 nil))

Log capture for Kubernetes tests in Travis

We currently have no way to capture and dump the output from Kubernetes pods running in our Travis test jobs. I found an example of a pretty simple setup (using fluentd) that would let us grab stdout/stderr on all running pods and cat them all into a single long-running system pod:

https://github.com/joatmon08/kubernetes-reference/tree/master/logging

We could then use kubectl logs to dump all the output to a file for debugging failures.

We should also ensure that Kitchen defaults to use stream=sys.stderr for logging when it's running in our test container on Kubernetes.

`test-request-queue-timeout-slow-start-app` is flaky with kubernetes scheduler

lein parallel-test :only waiter.request-timeout-test/test-request-queue-timeout-slow-start-app
FAIL in (test-request-queue-timeout-slow-start-app) (request_timeout_test.clj:55)
test-request-queue-timeout-slow-start-app
expected: (clojure.string/includes? body "Check that your service is able to start properly!")
  actual: (not (clojure.string/includes? "\n  Waiter Error 503\n  ================\n  \n    After 10 seconds, no instance available to handle request.\n  \n  Request Info\n  ============\n  \n            Host: 127.0.0.1:9091\n            Path: /req\n    Query String: \n          Method: POST\n             CID: test-request-queue-timeout-slow-start-app-b6a8186bb9-113a633c93c46be1\n            Time: 2019-01-20T16:49:28.869Z\n       Principal: travis\n      Service Id: waiter-service-testrequestqueuetimeoutslowstartapptravis1291530-f0735b4b12fa87ece3afa084afa0a2a3\n  \n  \n  Additional Info\n  ===============\n  \n    {:outstanding-requests 1,\n     :service-id\n     \"waiter-service-testrequestqueuetimeoutslowstartapptravis1291530-f0735b4b12fa87ece3afa084afa0a2a3\",\n     :work-stealing-offers-sent 0,\n     :work-stealing-offers-received 0,\n     :slots-assigned 0,\n     :slots-in-use 0,\n     :waiting-for-available-instance 1,\n     :status 503,\n     :slots-available 0,\n     :requests-waiting-to-stream 0}\n    \n  Getting Help\n  ============\n  \n    Waiter on GitHub: http://github.com/twosigma/waiter \n  \n" "Check that your service is able to start properly!"))

Alternative try-let syntax

We talked about using an alternate try-let syntax yesterday. Here's what I came up with:

(defrecord PassThru [caught-exception])

(defmacro try-let
  "Convenience macro for wrapping let binding expressions in a try/catch/finally block,
   while not catching exceptions thrown within the body of the let block.
   This construct adds optional :catch and :finally forms to the end of the let-binding vector.
   Each clause consists of a keyword-vector pair, similar to the :let clause in Clojure's `for` macro.
   The :catch vector must contain one or more forms,
   each of which should look like a `(catch ...)` clause, but without the `catch` keyword.
   The :finally vector's contents are simply wrapped in a `(finally ...)` block.
   At most one of each :catch and :finally may be provided,
   and if both are given, :catch must immedialy precede :finally.

   Example:

   (try-let [x (throw (RuntimeException. \"Hi\"))
             :catch [(IllegalArgumentException e (comment Cannot happen))
                     (Exception e (println \"Caught exception:\" (.getMessage e)))]
             :finally [(println \"doing cleanup\")
                       (comment Free some resources)]] x)"
  [bindings-vec & body]
  (assert (even? (count bindings-vec))
          "Odd number of forms in try-let binding vector. Binding-expression pairs are required.")
  (let [[binding-pairs catch+finally] (->> bindings-vec
                                           (partition 2)
                                           (split-with (comp not keyword? first)))
        bindings' (apply concat binding-pairs)
        clauses (->> catch+finally (apply concat) (apply array-map))
        {catches-vec :catch finally-body :finally} clauses
        clause-count (count catch+finally)]
    (assert (and (<= 1 clause-count 2)
                 (== (count clauses) clause-count)
                 (or (nil? catches-vec) (= :catch (ffirst catch+finally))))
            "Bindings vector must end with a :catch clause, or a :finally clause, or a :catch clause followed by a :finally clause.")
    `(let [try-result# (try
                         (let [~@bindings']
                           (try
                             ~@body
                             (catch java.lang.Throwable t#
                               (PassThru. t#))))
                         ~@(map (partial list* `catch) catches-vec)
                         ~@(when finally-body
                             `((finally ~@finally-body))))]
       (if (instance? PassThru try-result#)
         (throw (:caught-exception try-result#))
         try-result#))))

Sample usage (copied from the docstring):

   (try-let [x (throw (RuntimeException. "Hi"))
             :catch [(IllegalArgumentException e (comment Cannot happen))
                     (Exception e (println "Caught exception:" (.getMessage e)))]
             :finally [(println "doing cleanup")
                       (comment Free some resources)]] x)

Advantages of this version of the macro:

  • There is very little additional overhead for this solution in the case that no exception is thrown in the try-let body. (The macro in the currently-used library has to create an auxiliary vector, and then it repeats the destructuring and binding logic in both the inner and outer scopes.)
  • The :catch/:finally definitions go in the bindings list. That syntax feels idiomatic since it resembles :when/:let in the for macro, and it clearly denotes the intended scope of the :catch clauses.
  • The macro definition is pretty straightforward, and trivially supports destructuring (since it just plops the binding forms unchanged into another let block).

Disadvantages:

  • Catches and re-throws exceptions coming from the try-let body. However, the overhead of creating the exception (especially constructing the stack trace) should completely dominate the catch/throw.
  • Adds the PassThru record definition.
  • Some people might think having the :catch and :finally pieces in the binding list is "ugly"...

Full directory-contents support for Waiter-K8s

We should expose the files inside the working directory of the pods via Waiter as we do with the Marathon scheduler implementation.

We could do this by starting a little python file server in the home directory. Upon receiving a directory-contents request, we'd start the embedded file server with some reasonable timeout (e.g., 5 minutes), and if no requests come within that period, then we kill the file server process.

Using `binding` inside an `async/go` block is not generally safe

Consider the following example1:

(def ^:dynamic *foo* :original)

(defn -main [& _]
  (let [c (clojure.core.async/to-chan [42])]
    (clojure.core.async/go
      (binding [*foo* :rebound]
        (clojure.core.async/<! c)
        (println "done."))))

  (Thread/sleep 5000))

The go macro transforms its body to create a state machine that can be paused/resumed on parking operations. That state machine is then executed on a worker thread pool. Since the <! operation in the above example causes the state machine to park (i.e., yield), the worker thread that entered the binding block and pushed the new *foo* value onto its dynamic-bindings stack might not be the same thread that resumes the state machine (after the <! operation completes), which means that the dynamic-bindings pop operation at the end of the binding block can fail loudly:

Exception in thread "async-dispatch-3" java.lang.IllegalStateException: Pop without matching push

This is a known issue. The issue description says that it only applies to ahead-of-time-compiled code (e.g., apps run from an UberJAR)—but it's not clear to me why it wouldn't also apply to non-AOT-compiled code.

We've seen these errors cause Waiter to crash in our production environment. We should try our best to avoid using binding blocks within async/go blocks. If we can't avoid it, then we must ensure that there are no parking actions within the binding block.


[1]: Example adapted from the proof-of-concept in https://dev.clojure.org/jira/browse/ASYNC-170.

Add "Introduction to Waiter" to the docs

A rough outline for this page:

Batch Jobs Need Services

  • Many batch jobs repeating the same data loading / manipulation tasks
  • Desire to break out these tasks into shared services
  • The need for per-user services

Waiter Features

  • Simple service creation
  • Run-as-requestor
  • Scaling, up and down to zero
  • “Serverless”
  • Supports wide variety of request types and services
    • request duration
    • start-up time
    • concurrency guarantees

Waiter’s Value

  • Rough Usage Metrics
  • Long-lived services, e.g. BeakerX (websockets)

Add "docker" command type

We should add a docker command type in addition to our current shell command type.

The version string would be a docker image string (namespace/name:tag). We'd pull the image and run it via Marathon's docker support, or directly with the docker run command in the Shell scheduler.

Gracefully handle release of non-existing instance

If a service/instance gets deleted, the release mechanism throws an error:

2019-03-14 16:15:44,030 INFO  waiter.scheduler.marathon [pool-1-thread-7] - [CID=test-request-parallel-streaming-dda564e72b-5e50d491a39bbd21] deleting service waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97

2019-03-14 16:20:02,913 ERROR waiter.service [async-dispatch-44] - [CID=test-request-parallel-streaming-a57d509d95-d9e244d9521fab3] Error while releasing instance #waiter.scheduler.ServiceInstance{:id waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97.d8c7641c-4673-11e9-b175-0242ac110005, :service-id waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97, :started-at #clj-time/date-time "2019-03-14T16:11:38.563Z", :healthy? true, :health-check-status nil, :flags #{}, :exit-code nil, :host 172.17.0.7, :port 31859, :extra-ports [], :protocol http, :log-directory nil, :message nil}

clojure.lang.ExceptionInfo: Unable to find release-chan. {:instance #waiter.scheduler.ServiceInstance{:id "waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97.d8c7641c-4673-11e9-b175-0242ac110005", :service-id "waiter-service-testrequestparallelstreamingtravis1268262-d6fb2db944182843a85fd87914c9ed97", :started-at #clj-time/date-time "2019-03-14T16:11:38.563Z", :healthy? true, :health-check-status nil, :flags #{}, :exit-code nil, :host "172.17.0.7", :port 31859, :extra-ports [], :protocol "http", :log-directory nil, :message nil}}
	at waiter.service$release_instance_go$fn__25178$state_machine__10453__auto____25185$fn__25188.invoke(service.clj:175)
	at waiter.service$release_instance_go$fn__25178$state_machine__10453__auto____25185.invoke(service.clj:175)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:973)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:972)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:977)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:975)
	at clojure.core.async.impl.ioc_macros$take_BANG_$fn__10471.invoke(ioc_macros.clj:986)
	at clojure.core.async.impl.channels.ManyToManyChannel$fn__5480.invoke(channels.clj:265)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.