heroku / umpire Goto Github PK

HTTP metrics monitoring endpoint

Ruby 99.77% Shell 0.17% Procfile 0.06%

umpire's Introduction

Umpire

Overview

Umpire provides a normalized HTTP endpoint that responds with 200 / non-200 according to the metric check parameters specified in the requested URL. This endpoint can then be composed with existing HTTP-URL-monitoring tools like Pingdom to enable self-service QoS monitoring of metrics.

Usage Examples

Grab an UMPIRE_URL that you can use to query against:

$ export UMPIRE_URL=https://u:$(heroku config:get API_KEY -a umpire-production)@umpire.yourdomain.com

To respond with 200 if the pulse.nginx-requests-per-second metric has had an average value of less than 400 over the last 300 seconds:

$ curl -i "$UMPIRE_URL/check?metric=pulse.nginx-requests-per-second&max=400&range=300"

To respond with 200 if the custom.api.production.requests.per-sec metric has had an average value of more than 40 over the past 60 seconds:

$ curl -i "$UMPIRE_URL/check?metric=custom.api.production.requests.per-sec&min=40&range=60"

The default metrics target is Graphite. If you'd like to check Librato Metrics, just add a backend=librato query param:

$ curl -i "$UMPIRE_URL/check?metric=custom.api.production.requests.per-sec&min=40&range=60&backend=librato"

Librato Support

Librato returns multiple values from their API for a given metric. These include: count, min, max, sum, value (aka mean) and a few others. In addition librato allows you to optionally provide a statistical function used to generate an aggregated time series across sources by providing a group_by param.

We support both of these options by specifying your librato metrics like so: metric_name:<from>:<group_by>

Example fetching the active-connections metric, with group_by=sum and pulling values from the "count" field returned in the "all" block

$ curl -i "$UMPIRE_URL/check?metric=active-connections:count:sum&backend=librato&range=60&min=1"

For more info look here

Pass emtpy_ok=true to have umpire respond with a 200 if the metrics return with no value within a given range.

Aggregation

The default metric values aggregation method is averaging, but you can change it by adding an 'aggregate' query param. Possible aggregation methods are avg, sum, max and min.

Following query responds with 200 if the custom.api.production.requests.per-sec metric has had a maximum value of less than 400 over the last minute:

$ curl -i "$UMPIRE_URL/check?metric=custom.api.production.requests.per-sec&max=400&range=60&aggregate=max"

Following query responds with a 200 if the count of api.prod.addons.plan-changes.errors metrics has a maximum value of 10 over the last five minutes.

$ curl -i curl -i "$UMPIRE_URL/check?metric=api.prod.addons.plan-changes.errors:count&aggregate=sum&max=10&range=300&backend=librato&empty_ok=true"

Local Deploy

$ rvm use 1.9.2
$ bundle install
$ export DEPLOY=dev
$ export APP=umpire-$DEPLOY
$ export FORCE_HTTPS=false
$ export API_KEY=secret
$ export GRAPHITE_URL=https://graphite.yourdomain.com
$ foreman start
$ export UMPIRE_URL=http://umpire:[email protected]:5000
$ curl -i "$UMPIRE_URL/check?metric=pulse.nginx-requests-per-second&max=400&range=300"

Local Docker

$ docker-compose build
$ docker-compose run --rm web bash
docker$ bundle install --path .
docker$ bundle exec rake

Platform Deploy

$ export DEPLOY=production/staging/you
$ export APP=umpire-$DEPLOY
$ export API_KEY=$(openssl rand -hex 16)
$ heroku create -s cedar -r $DEPLOY umpire-$DEPLOY
$ heroku config:add -r $DEPLOY DEPLOY=$DEPLOY
$ heroku config:add -r $DEPLOY FORCE_HTTPS=true
$ heroku config:add -r $DEPLOY API_KEY=$API_KEY
$ heroku config:add -r $DEPLOY GRAPHITE_URL=https://graphite.yourdomain.com
$ git push $DEPLOY master
$ heroku scale -r $DEPLOY web=3
$ export UMPIRE_URL=https://umpire:$API_KEY@umpire-$DEPLOY.herokuapp.com
$ curl -i "$UMPIRE_URL/check?metric=pulse.nginx-requests-per-second&max=400&range=300"

Testing

$ bundle install
$ bundle exec rake

Health

Check the health of the Umpire process itself with:

$ curl -i "$UMPIRE_URL/health"

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

umpire's People

Contributors

Stargazers

Watchers

umpire's Issues

Warning state

In nagios we currently have many checks with warning and critical thresholds. Currently umpire only provides a way to check for ok vs not-ok states. Does it make sense to provide a way to define warning thresholds and return a distinct http response code in that case?

Configure API keys as salted hashes

API keys are essentially passwords. Current configuration method means keys are stored in plain-text.

Umpire could support a hashed API key configuration method, eg: HASHED_API_KEY(optional suffix)=(base64 encoded hash):(base64 encoded salt). The authorized? check could apply each salt to the basic auth credential, then apply the hash function and compare to the API key hash.

By using the HASHED_ prefix on the configuration entries, users can choose which mechanism they prefer and/or have a mix as they migrate.

If there is interest I may try to submit a PR to this effect.

Umpire doesn't respect graphite functions when using graphite style "{multiple,options}"

When using graphite functions like sumSeries() including multiple options specified with "{option1,option2,etc}", Umpire does not interprete them but make multiple call to graphite (one for each option) and gives multiple reponses.

For example :

This call returns only one respons (OK) :
/check?metric=sumSeries(server.cpu.*)&max=80&range=60
-> {"value":98.883335}
This call returns 3 responses (KO) :
/check?metric=averageSeries(server.cpu.{user,system,wait})&max=80&range=60

-> [1/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite
{"value":11.475000999999999}

-> [2/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite
{"value":8.016667}

-> [3/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite
{"value":21.458335499999997}

Better support incomplete data in latest Librato bucket

For some classes of data and metrics, the latest bucket information that the Librato API returns can range anywhere from partially incomplete to very close to empty.

If the magnitude of the normal value is large, this greatly affects the calculations and can trigger false failures.

Perhaps umpire needs an option to skip the latest bucket for data and calculations?

Consider not using HTTP return codes to indicate check status

@dougoku pointed out in a recent standup that responding with non-200 technically indicates a problem with the umpire service itself. Perhaps we should consider something like a status field in the body that could return either OK, WARNING, or CRITICAL, but always with a HTTP code of 200.

Ref #4.

Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?

composite metric queries should throw a 200 with missing metrics

If you're trying to query a composite metric and one of the metric does not have a value it throws a 404, even with empty_ok=true .

at=internal_error class=TypeError message="can't convert String into Integer"

Smaller ranges (<300) don't consistently return results

This seems to be an issue with how umpire is querying librato.

For example querying librato directly produces data (with count in the query):

$ curl -n 'https://metrics-api.librato.com/v1/metrics/midgard.requests.event.5xx?count=1&resolution=1&source=com.heroku.midgard.*' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   520  100   520    0     0   2453      0 --:--:-- --:--:-- --:--:--  2488
{
  "measurements": {
    "com.heroku.midgard.770036": [
      {
        "measure_time": 1456956060,
        "value": 1,
        "count": 120,
        "min": 1,
        "max": 1,
        "sum": 120,
        "sum_squares": 1
      }
    ]
  },
  "resolution": 1,
  "source_display_names": {},
  "name": "midgard.requests.event.5xx",
  "display_name": null,
  "type": "gauge",
  "attributes": {
    "l2met_type": "counter",
    "display_min": 0,
    "created_by_ua": "l2met/304fb673f7ad0e048ae519b030b1dbbf52bdbc36",
    "summarize_function": "sum",
    "aggregate": false,
    "display_stacked": false,
    "gap_detection": false
  },
  "description": null,
  "period": 1,
  "source_lag": null
}

However umpire does not see anything:

$ curl -in 'https://umpire.herokai.com/check?backend=librato&empty_ok=true&metric=midgard.requests.event.5xx:sum:sum&range=60&max=0&source=com.heroku.midgard.*'
HTTP/1.1 200 OK
Server: Cowboy
Date: Wed, 02 Mar 2016 22:07:49 GMT
Connection: keep-alive
Strict-Transport-Security: max-age=31536000
Content-Type: application/json;charset=utf-8
X-Content-Type-Options: nosniff
Content-Length: 94
Via: 1.1 vegur

{"error":"no values for metric in range","request_id":"718fda3d-ca23-4d38-be79-b974e9ab746a"}

log librato results

@joshuatobin @reidmix Would it be worth logging the results from librato that umpire fetch? Like we're doing in Laika.

more graceful handling of unreachable metrics service

Maybe return a nice 503 error message.

Librato empty results exception

During a partial logging outage, we've been seeing umpire return 500s. I tracked this down to a case where the metric exists in Librato but does not have any data in the range. It appears to return an empty hash for a result set, which causes an exception.

>  results = Umpire::LibratoMetrics.client.fetch(metric, :start_time => Time.now.to_i - 180, :summarize_sources => true)
=> {}
> results["all"].map
NoMethodError: undefined method `map' for nil:NilClass

Log line missing status...

When empty okay and empty, we're not explicitly setting status 200, as sinatra sets that by default. However, b/c we're not explicitly setting it, it's not being logged with the "at=finish" log line, which is messing up splunk queries. We should add status 200 here: https://github.com/heroku/umpire/blob/master/lib/umpire/web.rb#L157-L159

Compose functions and empty values have problems

2013-05-09T22:02:56.581822+00:00 app[web.5]: TypeError - nil can't be coerced into Float:
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in +' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:inblock (2 levels) in initialize'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in each' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:ininject'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in block in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:inmap'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:innew'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:in compose_values_for_range' 2013-05-09T22:02:56.582131+00:00 app[web.5]: /app/lib/umpire/web.rb:54:infetch_points'

Return links to metrics involved for errors

Umpire Playbook needed.

Umpire needs a playbook.

Here's some items that should go in there:

Splunking

# basic
index=main app="umpire-api-va-prod" 

# sans checks
index=main app="umpire-api-va-prod" action!="check"

# sans health && checks
index=main app="umpire-api-va-prod"  action!="health"

Logs

heroku logs -t -a umpire-api-va-prod

Better documentation

It would really be nice to have a bunch of examples that shows what can be done in a librato instrument, then have the corresponding watchtower/umpire setup to replicate the same behavior.

For example, today we had an instrument that did a sum of sums on a metric. This turn out to putting sum:sum after the metric name (not using composites or aggregates) to get the correct behavior.

I imagine a case were we have pictures on one side, and how to set it up in the watchtower form in the other.

Timeouts in the logs...

What's this all about?

2018-01-17T20:38:24.027961+00:00 app[web.2]: source=rack-timeout id=86037e90-28cc-9158-9885-0eb06b67f2d0 wait=0ms timeout=29000ms service=55ms state=completed at=info
2018-01-17T20:38:24.056531+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.071656+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.076675+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.077580+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.100988+00:00 app[web.2]: source=rack-timeout id=a723bcc2-c84d-1cbd-b3d0-2050fdfae93d wait=2ms timeout=29000ms service=133ms state=completed at=info
2018-01-17T20:38:24.105988+00:00 app[web.2]: source=rack-timeout id=84ad24b7-2330-1af6-8ad3-476ffb7308bc wait=1ms timeout=29000ms service=132ms state=completed at=info
2018-01-17T20:38:24.109326+00:00 app[web.2]: source=rack-timeout id=dd5d92d8-07f7-6a50-75bb-2226b0382cd3 wait=18ms timeout=29000ms service=189ms state=completed at=info
2018-01-17T20:38:24.120960+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms service=43ms state=completed at=info
2018-01-17T20:38:24.029738+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.068162+00:00 app[web.1]: source=rack-timeout id=5e220dc1-1a10-5088-c9f3-dea2b3c404ad wait=1ms timeout=29000ms service=155ms state=completed at=info
2018-01-17T20:38:24.073882+00:00 app[web.1]: source=rack-timeout id=bafe112e-10ea-d121-6d06-c69c8e219112 wait=23ms timeout=29000ms service=255ms state=completed at=info
2018-01-17T20:38:24.079489+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.084190+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.087808+00:00 app[web.1]: source=rack-timeout id=095d5a05-856d-3404-f345-32ce76c073ef wait=2ms timeout=29000ms service=88ms state=completed at=info
2018-01-17T20:38:24.124230+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms service=72ms state=completed at=info
2018-01-17T20:38:24.125703+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms service=49ms state=completed at=info
2018-01-17T20:38:24.151127+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153035+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.190697+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.201361+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms service=50ms state=completed at=info
2018-01-17T20:38:24.087809+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.113436+00:00 app[web.1]: source=rack-timeout id=0597f1ed-082f-3674-62ee-5cc041994279 wait=2ms timeout=29000ms service=91ms state=completed at=info
2018-01-17T20:38:24.113666+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.148027+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms service=120ms state=completed at=info
2018-01-17T20:38:24.215926+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.226692+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.268269+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms service=42ms state=completed at=info
2018-01-17T20:38:24.268919+00:00 app[web.2]: source=rack-timeout id=e162a7df-cf03-00c8-a276-edf63616b0be wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.152022+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms service=73ms state=completed at=info
2018-01-17T20:38:24.153085+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153408+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.162681+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms service=75ms state=completed at=info
2018-01-17T20:38:24.187177+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms service=103ms state=completed at=info
2018-01-17T20:38:24.198225+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms service=45ms state=completed at=info
2018-01-17T20:38:24.216474+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.219008+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms service=105ms state=completed at=info
2018-01-17T20:38:24.267240+00:00 app[web.1]: source=rack-timeout id=1649c12c-fd07-0718-1059-d70441409a6d wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098950+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098952+00:00 app[web.3]: source=rack-timeout id=0d70a5cb-dfd2-15f7-1db6-34342b607f6a wait=1ms timeout=29000ms service=324ms state=completed at=info
2018-01-17T20:38:24.114681+00:00 app[web.3]: source=rack-timeout id=344031ad-c8b0-6947-deae-c92baf5843e8 wait=8ms timeout=29000ms service=131ms state=completed at=info
2018-01-17T20:38:24.115883+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.119048+00:00 app[web.3]: source=rack-timeout id=00a19954-f916-9d4d-5886-66e886a64e69 wait=8ms timeout=29000ms service=136ms state=completed at=info
2018-01-17T20:38:24.158230+00:00 app[web.3]: source=rack-timeout id=cb0648a1-9d74-112d-81d9-4180e584f8ce wait=9ms timeout=29000ms service=118ms state=completed at=info
2018-01-17T20:38:24.171886+00:00 app[web.3]: source=rack-timeout id=78b37363-61fe-06b1-eb13-978fb4b5b42c wait=1ms timeout=29000ms service=150ms state=completed at=info
2018-01-17T20:38:24.172968+00:00 app[web.3]: source=rack-timeout id=7ef1327d-7c15-3aab-a336-30e8cb0f8a64 wait=4ms timeout=29000ms service=206ms state=completed at=info
2018-01-17T20:38:24.210389+00:00 app[web.3]: source=rack-timeout id=c0a3bff6-4610-b529-889b-184267c58998 wait=4ms timeout=29000ms service=267ms state=completed at=info
2018-01-17T20:38:24.216546+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.231652+00:00 app[web.3]: source=rack-timeout id=161d317c-d6c9-1387-745b-806ca420e57c wait=7ms timeout=29000ms service=186ms state=completed at=info
2018-01-17T20:38:24.245630+00:00 app[web.3]: source=rack-timeout id=4ab12628-fa44-b556-aa97-e07dd701fdad wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.271905+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms service=156ms state=completed at=info
2018-01-17T20:38:24.312727+00:00 app[web.3]: source=rack-timeout id=664ae8bc-3156-2044-8bdd-244353e919f8 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.319758+00:00 app[web.3]: source=rack-timeout id=838e7d16-b9cb-83bb-4e0d-ad0cc31f110f wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.322598+00:00 app[web.3]: source=rack-timeout id=2425b640-7310-8d2e-9b75-35e32455c0bb wait=0ms timeout=29000ms state=ready at=info

Set User Agent to umpire / version and stuff

Implement StandardDB

Implement https://github.com/testdouble/standard for uniform formatting, catching styling issues, and catching errors early

Undeclared dev dependency

I tried to run the tests but got an error about rack-timeout missing.

After running gem install rack-timeout the tests ran successfully.

revisit bamboo / cedar latency checks

We've been getting a lot of noise on the Bamboo and Cedar latency checks in Pingdom, including occasional flakes and a series of flakes that led to a pair of pages to @ricardochimal in the IC rotation.

To address this noise, we've recently increased the metric range from 60 to 300 seconds and also deployed a change that will give us better visibility in Pingdom into what caused 404s (probably the metric range issue above, but not positive).

@mmcgrana to revisit in a few days to see what these changes show us.

query ranges < 300

I know we've run into issues a few times where librato seemingly has the data but querying for a range < 300 returns 'no data'.

Example: https://github.com/heroku/watchtower-tng/issues/318#issuecomment-198045016

add median to the aggregation list.

Is this possible? librato supports it and it would be nice for generating fewer spurious alerts on brief large outlier values.