heroku / umpire Goto Github PK
View Code? Open in Web Editor NEWHTTP metrics monitoring endpoint
HTTP metrics monitoring endpoint
This seems to be an issue with how umpire is querying librato.
For example querying librato directly produces data (with count
in the query):
$ curl -n 'https://metrics-api.librato.com/v1/metrics/midgard.requests.event.5xx?count=1&resolution=1&source=com.heroku.midgard.*' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 520 100 520 0 0 2453 0 --:--:-- --:--:-- --:--:-- 2488
{
"measurements": {
"com.heroku.midgard.770036": [
{
"measure_time": 1456956060,
"value": 1,
"count": 120,
"min": 1,
"max": 1,
"sum": 120,
"sum_squares": 1
}
]
},
"resolution": 1,
"source_display_names": {},
"name": "midgard.requests.event.5xx",
"display_name": null,
"type": "gauge",
"attributes": {
"l2met_type": "counter",
"display_min": 0,
"created_by_ua": "l2met/304fb673f7ad0e048ae519b030b1dbbf52bdbc36",
"summarize_function": "sum",
"aggregate": false,
"display_stacked": false,
"gap_detection": false
},
"description": null,
"period": 1,
"source_lag": null
}
However umpire does not see anything:
$ curl -in 'https://umpire.herokai.com/check?backend=librato&empty_ok=true&metric=midgard.requests.event.5xx:sum:sum&range=60&max=0&source=com.heroku.midgard.*'
HTTP/1.1 200 OK
Server: Cowboy
Date: Wed, 02 Mar 2016 22:07:49 GMT
Connection: keep-alive
Strict-Transport-Security: max-age=31536000
Content-Type: application/json;charset=utf-8
X-Content-Type-Options: nosniff
Content-Length: 94
Via: 1.1 vegur
{"error":"no values for metric in range","request_id":"718fda3d-ca23-4d38-be79-b974e9ab746a"}
In nagios we currently have many checks with warning and critical thresholds. Currently umpire only provides a way to check for ok vs not-ok states. Does it make sense to provide a way to define warning thresholds and return a distinct http response code in that case?
I know we've run into issues a few times where librato seemingly has the data but querying for a range < 300 returns 'no data'.
Example: https://github.com/heroku/watchtower-tng/issues/318#issuecomment-198045016
@dougoku pointed out in a recent standup that responding with non-200 technically indicates a problem with the umpire service itself. Perhaps we should consider something like a status field in the body that could return either OK, WARNING, or CRITICAL, but always with a HTTP code of 200.
Ref #4.
Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?
API keys are essentially passwords. Current configuration method means keys are stored in plain-text.
Umpire could support a hashed API key configuration method, eg: HASHED_API_KEY(optional suffix)=(base64 encoded hash):(base64 encoded salt)
. The authorized?
check could apply each salt to the basic auth credential, then apply the hash function and compare to the API key hash.
By using the HASHED_
prefix on the configuration entries, users can choose which mechanism they prefer and/or have a mix as they migrate.
If there is interest I may try to submit a PR to this effect.
If you're trying to query a composite metric and one of the metric does not have a value it throws a 404, even with empty_ok=true
.
at=internal_error class=TypeError message="can't convert String into Integer"
Implement https://github.com/testdouble/standard for uniform formatting, catching styling issues, and catching errors early
Maybe return a nice 503 error message.
Umpire needs a playbook.
Here's some items that should go in there:
# basic
index=main app="umpire-api-va-prod"
# sans checks
index=main app="umpire-api-va-prod" action!="check"
# sans health && checks
index=main app="umpire-api-va-prod" action!="health"
heroku logs -t -a umpire-api-va-prod
When empty okay and empty, we're not explicitly setting status 200
, as sinatra sets that by default. However, b/c we're not explicitly setting it, it's not being logged with the "at=finish" log line, which is messing up splunk queries. We should add status 200
here: https://github.com/heroku/umpire/blob/master/lib/umpire/web.rb#L157-L159
We've been getting a lot of noise on the Bamboo and Cedar latency checks in Pingdom, including occasional flakes and a series of flakes that led to a pair of pages to @ricardochimal in the IC rotation.
To address this noise, we've recently increased the metric range from 60 to 300 seconds and also deployed a change that will give us better visibility in Pingdom into what caused 404s (probably the metric range issue above, but not positive).
@mmcgrana to revisit in a few days to see what these changes show us.
During a partial logging outage, we've been seeing umpire return 500s. I tracked this down to a case where the metric exists in Librato but does not have any data in the range. It appears to return an empty hash for a result set, which causes an exception.
> results = Umpire::LibratoMetrics.client.fetch(metric, :start_time => Time.now.to_i - 180, :summarize_sources => true)
=> {}
> results["all"].map
NoMethodError: undefined method `map' for nil:NilClass
2013-05-09T22:02:56.581822+00:00 app[web.5]: TypeError - nil can't be coerced into Float:
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in +' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in
block (2 levels) in initialize'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in each' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in
inject'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:74:in block in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:in
map'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:73:in initialize' 2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:in
new'
2013-05-09T22:02:56.581822+00:00 app[web.5]: /app/lib/umpire/librato_metrics.rb:41:in compose_values_for_range' 2013-05-09T22:02:56.582131+00:00 app[web.5]: /app/lib/umpire/web.rb:54:in
fetch_points'
For some classes of data and metrics, the latest bucket information that the Librato API returns can range anywhere from partially incomplete to very close to empty.
If the magnitude of the normal value is large, this greatly affects the calculations and can trigger false failures.
Perhaps umpire needs an option to skip the latest bucket for data and calculations?
Is this possible? librato supports it and it would be nice for generating fewer spurious alerts on brief large outlier values.
It would really be nice to have a bunch of examples that shows what can be done in a librato instrument, then have the corresponding watchtower/umpire setup to replicate the same behavior.
For example, today we had an instrument that did a sum of sums on a metric. This turn out to putting sum:sum
after the metric name (not using composites or aggregates) to get the correct behavior.
I imagine a case were we have pictures on one side, and how to set it up in the watchtower form in the other.
@joshuatobin @reidmix Would it be worth logging the results from librato that umpire fetch? Like we're doing in Laika.
I tried to run the tests but got an error about rack-timeout missing.
After running gem install rack-timeout
the tests ran successfully.
What's this all about?
2018-01-17T20:38:24.027961+00:00 app[web.2]: source=rack-timeout id=86037e90-28cc-9158-9885-0eb06b67f2d0 wait=0ms timeout=29000ms service=55ms state=completed at=info
2018-01-17T20:38:24.056531+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.071656+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.076675+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.077580+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.100988+00:00 app[web.2]: source=rack-timeout id=a723bcc2-c84d-1cbd-b3d0-2050fdfae93d wait=2ms timeout=29000ms service=133ms state=completed at=info
2018-01-17T20:38:24.105988+00:00 app[web.2]: source=rack-timeout id=84ad24b7-2330-1af6-8ad3-476ffb7308bc wait=1ms timeout=29000ms service=132ms state=completed at=info
2018-01-17T20:38:24.109326+00:00 app[web.2]: source=rack-timeout id=dd5d92d8-07f7-6a50-75bb-2226b0382cd3 wait=18ms timeout=29000ms service=189ms state=completed at=info
2018-01-17T20:38:24.120960+00:00 app[web.2]: source=rack-timeout id=f3623fc2-66d4-cef3-32bf-aa6a37d5637c wait=1ms timeout=29000ms service=43ms state=completed at=info
2018-01-17T20:38:24.029738+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.068162+00:00 app[web.1]: source=rack-timeout id=5e220dc1-1a10-5088-c9f3-dea2b3c404ad wait=1ms timeout=29000ms service=155ms state=completed at=info
2018-01-17T20:38:24.073882+00:00 app[web.1]: source=rack-timeout id=bafe112e-10ea-d121-6d06-c69c8e219112 wait=23ms timeout=29000ms service=255ms state=completed at=info
2018-01-17T20:38:24.079489+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.084190+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.087808+00:00 app[web.1]: source=rack-timeout id=095d5a05-856d-3404-f345-32ce76c073ef wait=2ms timeout=29000ms service=88ms state=completed at=info
2018-01-17T20:38:24.124230+00:00 app[web.2]: source=rack-timeout id=b7708bc7-f36c-92de-a87b-ff2d780f4077 wait=0ms timeout=29000ms service=72ms state=completed at=info
2018-01-17T20:38:24.125703+00:00 app[web.2]: source=rack-timeout id=18496d93-714b-a343-208d-defaaf375846 wait=1ms timeout=29000ms service=49ms state=completed at=info
2018-01-17T20:38:24.151127+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153035+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.190697+00:00 app[web.2]: source=rack-timeout id=93dbe4e6-95a2-5d2c-4625-86f0e988fc5c wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.201361+00:00 app[web.2]: source=rack-timeout id=42301bbf-909e-5917-2d4f-f57f3cfe5c27 wait=1ms timeout=29000ms service=50ms state=completed at=info
2018-01-17T20:38:24.087809+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.113436+00:00 app[web.1]: source=rack-timeout id=0597f1ed-082f-3674-62ee-5cc041994279 wait=2ms timeout=29000ms service=91ms state=completed at=info
2018-01-17T20:38:24.113666+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.148027+00:00 app[web.1]: source=rack-timeout id=a446f98e-fe68-c58c-8803-6ffb1fc0e078 wait=0ms timeout=29000ms service=120ms state=completed at=info
2018-01-17T20:38:24.215926+00:00 app[web.2]: source=rack-timeout id=7c414525-4f84-a4f8-aa62-041243abde10 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.226692+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.268269+00:00 app[web.2]: source=rack-timeout id=8defa6f8-c842-0211-fe0f-fa824ae4631b wait=1ms timeout=29000ms service=42ms state=completed at=info
2018-01-17T20:38:24.268919+00:00 app[web.2]: source=rack-timeout id=e162a7df-cf03-00c8-a276-edf63616b0be wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.152022+00:00 app[web.1]: source=rack-timeout id=1cd46f5f-3fad-1c0a-8b37-c57f0f748308 wait=38ms timeout=29000ms service=73ms state=completed at=info
2018-01-17T20:38:24.153085+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.153408+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.162681+00:00 app[web.1]: source=rack-timeout id=f053dd5e-e9c5-b605-bdc6-1384364250cb wait=3ms timeout=29000ms service=75ms state=completed at=info
2018-01-17T20:38:24.187177+00:00 app[web.1]: source=rack-timeout id=cacad4d9-1863-56e1-96db-25706a2b44cf wait=2ms timeout=29000ms service=103ms state=completed at=info
2018-01-17T20:38:24.198225+00:00 app[web.1]: source=rack-timeout id=f72a51dc-8a14-504c-3b4a-ab77c48ff88e wait=1ms timeout=29000ms service=45ms state=completed at=info
2018-01-17T20:38:24.216474+00:00 app[web.1]: source=rack-timeout id=78429aab-29e7-0cb1-9b2b-d997b0116280 wait=1ms timeout=29000ms service=63ms state=completed at=info
2018-01-17T20:38:24.219008+00:00 app[web.1]: source=rack-timeout id=973ef69d-91e8-67c4-ed47-faea7a040e4f wait=20ms timeout=29000ms service=105ms state=completed at=info
2018-01-17T20:38:24.267240+00:00 app[web.1]: source=rack-timeout id=1649c12c-fd07-0718-1059-d70441409a6d wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098950+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.098952+00:00 app[web.3]: source=rack-timeout id=0d70a5cb-dfd2-15f7-1db6-34342b607f6a wait=1ms timeout=29000ms service=324ms state=completed at=info
2018-01-17T20:38:24.114681+00:00 app[web.3]: source=rack-timeout id=344031ad-c8b0-6947-deae-c92baf5843e8 wait=8ms timeout=29000ms service=131ms state=completed at=info
2018-01-17T20:38:24.115883+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.119048+00:00 app[web.3]: source=rack-timeout id=00a19954-f916-9d4d-5886-66e886a64e69 wait=8ms timeout=29000ms service=136ms state=completed at=info
2018-01-17T20:38:24.158230+00:00 app[web.3]: source=rack-timeout id=cb0648a1-9d74-112d-81d9-4180e584f8ce wait=9ms timeout=29000ms service=118ms state=completed at=info
2018-01-17T20:38:24.171886+00:00 app[web.3]: source=rack-timeout id=78b37363-61fe-06b1-eb13-978fb4b5b42c wait=1ms timeout=29000ms service=150ms state=completed at=info
2018-01-17T20:38:24.172968+00:00 app[web.3]: source=rack-timeout id=7ef1327d-7c15-3aab-a336-30e8cb0f8a64 wait=4ms timeout=29000ms service=206ms state=completed at=info
2018-01-17T20:38:24.210389+00:00 app[web.3]: source=rack-timeout id=c0a3bff6-4610-b529-889b-184267c58998 wait=4ms timeout=29000ms service=267ms state=completed at=info
2018-01-17T20:38:24.216546+00:00 app[web.3]: source=rack-timeout id=c1050bdf-f7cc-8607-6f5a-e3ecde65b4a4 wait=11ms timeout=29000ms service=119ms state=completed at=info
2018-01-17T20:38:24.231652+00:00 app[web.3]: source=rack-timeout id=161d317c-d6c9-1387-745b-806ca420e57c wait=7ms timeout=29000ms service=186ms state=completed at=info
2018-01-17T20:38:24.245630+00:00 app[web.3]: source=rack-timeout id=4ab12628-fa44-b556-aa97-e07dd701fdad wait=2ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.271905+00:00 app[web.3]: source=rack-timeout id=861ddbb9-c349-0a7a-1596-8a16562ae5f5 wait=1ms timeout=29000ms service=156ms state=completed at=info
2018-01-17T20:38:24.312727+00:00 app[web.3]: source=rack-timeout id=664ae8bc-3156-2044-8bdd-244353e919f8 wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.319758+00:00 app[web.3]: source=rack-timeout id=838e7d16-b9cb-83bb-4e0d-ad0cc31f110f wait=1ms timeout=29000ms state=ready at=info
2018-01-17T20:38:24.322598+00:00 app[web.3]: source=rack-timeout id=2425b640-7310-8d2e-9b75-35e32455c0bb wait=0ms timeout=29000ms state=ready at=info
When using graphite functions like sumSeries() including multiple options specified with "{option1,option2,etc}", Umpire does not interprete them but make multiple call to graphite (one for each option) and gives multiple reponses.
For example :
-> [1/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.user)&min=40&range=120&backend=graphite
{"value":11.475000999999999}
-> [2/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.system)&min=40&range=120&backend=graphite
{"value":8.016667}
-> [3/3]: http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite -->
--curl--http://umpire:[email protected]:5000/check?metric=averageSeries(couch1.aggregation.cpu-average.cpu.wait)&min=40&range=120&backend=graphite
{"value":21.458335499999997}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.