Git Product home page Git Product logo

umpire's Issues

Smaller ranges (<300) don't consistently return results

This seems to be an issue with how umpire is querying librato.

For example querying librato directly produces data (with count in the query):

$ curl -n 'https://metrics-api.librato.com/v1/metrics/midgard.requests.event.5xx?count=1&resolution=1&source=com.heroku.midgard.*' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   520  100   520    0     0   2453      0 --:--:-- --:--:-- --:--:--  2488
{
  "measurements": {
    "com.heroku.midgard.770036": [
      {
        "measure_time": 1456956060,
        "value": 1,
        "count": 120,
        "min": 1,
        "max": 1,
        "sum": 120,
        "sum_squares": 1
      }
    ]
  },
  "resolution": 1,
  "source_display_names": {},
  "name": "midgard.requests.event.5xx",
  "display_name": null,
  "type": "gauge",
  "attributes": {
    "l2met_type": "counter",
    "display_min": 0,
    "created_by_ua": "l2met/304fb673f7ad0e048ae519b030b1dbbf52bdbc36",
    "summarize_function": "sum",
    "aggregate": false,
    "display_stacked": false,
    "gap_detection": false
  },
  "description": null,
  "period": 1,
  "source_lag": null
}

However umpire does not see anything:

$ curl -in 'https://umpire.herokai.com/check?backend=librato&empty_ok=true&metric=midgard.requests.event.5xx:sum:sum&range=60&max=0&source=com.heroku.midgard.*'
HTTP/1.1 200 OK
Server: Cowboy
Date: Wed, 02 Mar 2016 22:07:49 GMT
Connection: keep-alive
Strict-Transport-Security: max-age=31536000
Content-Type: application/json;charset=utf-8
X-Content-Type-Options: nosniff
Content-Length: 94
Via: 1.1 vegur

{"error":"no values for metric in range","request_id":"718fda3d-ca23-4d38-be79-b974e9ab746a"}

Warning state

In nagios we currently have many checks with warning and critical thresholds. Currently umpire only provides a way to check for ok vs not-ok states. Does it make sense to provide a way to define warning thresholds and return a distinct http response code in that case?

Consider *not* using HTTP return codes to indicate check status

@dougoku pointed out in a recent standup that responding with non-200 technically indicates a problem with the umpire service itself. Perhaps we should consider something like a status field in the body that could return either OK, WARNING, or CRITICAL, but always with a HTTP code of 200.

Ref #4.

Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?

Configure API keys as salted hashes

API keys are essentially passwords. Current configuration method means keys are stored in plain-text.

Umpire could support a hashed API key configuration method, eg: HASHED_API_KEY(optional suffix)=(base64 encoded hash):(base64 encoded salt). The authorized? check could apply each salt to the basic auth credential, then apply the hash function and compare to the API key hash.

By using the HASHED_ prefix on the configuration entries, users can choose which mechanism they prefer and/or have a mix as they migrate.

If there is interest I may try to submit a PR to this effect.

Umpire Playbook needed.

Umpire needs a playbook.

Here's some items that should go in there:

Links

Splunking

# basic
index=main app="umpire-api-va-prod" 

# sans checks
index=main app="umpire-api-va-prod" action!="check"

# sans health && checks
index=main app="umpire-api-va-prod"  action!="health"

Logs

heroku logs -t -a umpire-api-va-prod

revisit bamboo / cedar latency checks

We've been getting a lot of noise on the Bamboo and Cedar latency checks in Pingdom, including occasional flakes and a series of flakes that led to a pair of pages to @ricardochimal in the IC rotation.

To address this noise, we've recently increased the metric range from 60 to 300 seconds and also deployed a change that will give us better visibility in Pingdom into what caused 404s (probably the metric range issue above, but not positive).

@mmcgrana to revisit in a few days to see what these changes show us.

Librato empty results exception

During a partial logging outage, we've been seeing umpire return 500s. I tracked this down to a case where the metric exists in Librato but does not have any data in the range. It appears to return an empty hash for a result set, which causes an exception.

>  results = Umpire::LibratoMetrics.client.fetch(metric, :start_time => Time.now.to_i - 180, :summarize_sources => true)
=> {}
> results["all"].map
NoMethodError: undefined method `map' for nil:NilClass

Compose functions and empty values have problems

2013-05-09T22:02:56.581822+00:00 app[web.5]: TypeError - nil can't be coerced into Float:
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in +' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:inblock (2 levels) in initialize'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in each' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:ininject'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in block in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:inmap'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:innew'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:in compose_values_for_range' 2013-05-09T22:02:56.582131+00:00 app[web.5]: /app/lib/umpire/web.rb:54:infetch_points'

Better support incomplete data in latest Librato bucket

For some classes of data and metrics, the latest bucket information that the Librato API returns can range anywhere from partially incomplete to very close to empty.

If the magnitude of the normal value is large, this greatly affects the calculations and can trigger false failures.

Perhaps umpire needs an option to skip the latest bucket for data and calculations?

Better documentation

It would really be nice to have a bunch of examples that shows what can be done in a librato instrument, then have the corresponding watchtower/umpire setup to replicate the same behavior.

For example, today we had an instrument that did a sum of sums on a metric. This turn out to putting sum:sum after the metric name (not using composites or aggregates) to get the correct behavior.

I imagine a case were we have pictures on one side, and how to set it up in the watchtower form in the other.

Undeclared dev dependency

I tried to run the tests but got an error about rack-timeout missing.

After running gem install rack-timeout the tests ran successfully.

Timeouts in the logs...

What's this all about?

2018-01-17T20:38:24.027961+00:00 app[web.2]: source=rack-timeout id=86037e90-28cc-9158-9885-0eb06b67f2d0 wait=0ms timeout=29000ms service=55ms state=completed at=info
2018-01-17T20:38:24.056531+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.071656+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.076675+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.077580+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.100988+00:00 app[web.2]: source=rack-timeout id=a723bcc2-c84d-1cbd-b3d0-2050fdfae93d wait=2ms timeout=29000ms service=133ms state=completed at=info
2018-01-17T20:38:24.105988+00:00 app[web.2]: source=rack-timeout id=84ad24b7-2330-1af6-8ad3-476ffb7308bc wait=1ms timeout=29000ms service=132ms state=completed at=info
2018-01-17T20:38:24.109326+00:00 app[web.2]: source=rack-timeout id=dd5d92d8-07f7-6a50-75bb-2226b0382cd3 wait=18ms timeout=29000ms service=189ms state=completed at=info
2018-01-17T20:38:24.120960+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms service=43ms state=completed at=info
2018-01-17T20:38:24.029738+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.068162+00:00 app[web.1]: source=rack-timeout id=5e220dc1-1a10-5088-c9f3-dea2b3c404ad wait=1ms timeout=29000ms service=155ms state=completed at=info
2018-01-17T20:38:24.073882+00:00 app[web.1]: source=rack-timeout id=bafe112e-10ea-d121-6d06-c69c8e219112 wait=23ms timeout=29000ms service=255ms state=completed at=info
2018-01-17T20:38:24.079489+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.084190+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.087808+00:00 app[web.1]: source=rack-timeout id=095d5a05-856d-3404-f345-32ce76c073ef wait=2ms timeout=29000ms service=88ms state=completed at=info
2018-01-17T20:38:24.124230+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms service=72ms state=completed at=info
2018-01-17T20:38:24.125703+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms service=49ms state=completed at=info
2018-01-17T20:38:24.151127+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153035+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.190697+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.201361+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms service=50ms state=completed at=info
2018-01-17T20:38:24.087809+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.113436+00:00 app[web.1]: source=rack-timeout id=0597f1ed-082f-3674-62ee-5cc041994279 wait=2ms timeout=29000ms service=91ms state=completed at=info
2018-01-17T20:38:24.113666+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.148027+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms service=120ms state=completed at=info
2018-01-17T20:38:24.215926+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.226692+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.268269+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms service=42ms state=completed at=info
2018-01-17T20:38:24.268919+00:00 app[web.2]: source=rack-timeout id=e162a7df-cf03-00c8-a276-edf63616b0be wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.152022+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms service=73ms state=completed at=info
2018-01-17T20:38:24.153085+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153408+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.162681+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms service=75ms state=completed at=info
2018-01-17T20:38:24.187177+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms service=103ms state=completed at=info
2018-01-17T20:38:24.198225+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms service=45ms state=completed at=info
2018-01-17T20:38:24.216474+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.219008+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms service=105ms state=completed at=info
2018-01-17T20:38:24.267240+00:00 app[web.1]: source=rack-timeout id=1649c12c-fd07-0718-1059-d70441409a6d wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098950+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098952+00:00 app[web.3]: source=rack-timeout id=0d70a5cb-dfd2-15f7-1db6-34342b607f6a wait=1ms timeout=29000ms service=324ms state=completed at=info
2018-01-17T20:38:24.114681+00:00 app[web.3]: source=rack-timeout id=344031ad-c8b0-6947-deae-c92baf5843e8 wait=8ms timeout=29000ms service=131ms state=completed at=info
2018-01-17T20:38:24.115883+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.119048+00:00 app[web.3]: source=rack-timeout id=00a19954-f916-9d4d-5886-66e886a64e69 wait=8ms timeout=29000ms service=136ms state=completed at=info
2018-01-17T20:38:24.158230+00:00 app[web.3]: source=rack-timeout id=cb0648a1-9d74-112d-81d9-4180e584f8ce wait=9ms timeout=29000ms service=118ms state=completed at=info
2018-01-17T20:38:24.171886+00:00 app[web.3]: source=rack-timeout id=78b37363-61fe-06b1-eb13-978fb4b5b42c wait=1ms timeout=29000ms service=150ms state=completed at=info
2018-01-17T20:38:24.172968+00:00 app[web.3]: source=rack-timeout id=7ef1327d-7c15-3aab-a336-30e8cb0f8a64 wait=4ms timeout=29000ms service=206ms state=completed at=info
2018-01-17T20:38:24.210389+00:00 app[web.3]: source=rack-timeout id=c0a3bff6-4610-b529-889b-184267c58998 wait=4ms timeout=29000ms service=267ms state=completed at=info
2018-01-17T20:38:24.216546+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.231652+00:00 app[web.3]: source=rack-timeout id=161d317c-d6c9-1387-745b-806ca420e57c wait=7ms timeout=29000ms service=186ms state=completed at=info
2018-01-17T20:38:24.245630+00:00 app[web.3]: source=rack-timeout id=4ab12628-fa44-b556-aa97-e07dd701fdad wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.271905+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms service=156ms state=completed at=info
2018-01-17T20:38:24.312727+00:00 app[web.3]: source=rack-timeout id=664ae8bc-3156-2044-8bdd-244353e919f8 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.319758+00:00 app[web.3]: source=rack-timeout id=838e7d16-b9cb-83bb-4e0d-ad0cc31f110f wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.322598+00:00 app[web.3]: source=rack-timeout id=2425b640-7310-8d2e-9b75-35e32455c0bb wait=0ms timeout=29000ms state=ready at=info

Umpire doesn't respect graphite functions when using graphite style "{multiple,options}"

When using graphite functions like sumSeries() including multiple options specified with "{option1,option2,etc}", Umpire does not interprete them but make multiple call to graphite (one for each option) and gives multiple reponses.

For example :

  • This call returns only one respons (OK) :
    /check?metric=sumSeries(server.cpu.*)&max=80&range=60
    -> {"value":98.883335}
  • This call returns 3 responses (KO) :
    /check?metric=averageSeries(server.cpu.{user,system,wait})&max=80&range=60

-> [1/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite
{"value":11.475000999999999}

-> [2/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite
{"value":8.016667}

-> [3/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite
{"value":21.458335499999997}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.