Comments (7)
I think the issue is that if a new datapoint is added to prometheus' TSDB between the Good & the Valid query, you get an offset between them, leading to this kind of behaviour.
You would need to send a finite timeframe to prometheus and make sure that this timeframe is further from "now" than the scrape interval of these metrics.
The only alternative is to make Prometheus perform the division and only query an SLI from it, to ensure consistency. (which may require developpement, depending on the backend current implementation.) But the downside from that is you cannot export good & bad event metrics anymore, by doing so.
In my opinion, this issue is probably similar to #343 (although with different backends)
from slo-generator.
Hi @maksim-paskal, thanks for raising this issue.
Any chance you could export the timeseries that generated these results and share them here (assuming there is nothing confidential) so I can reproduce the issue on my machine?
Then can you also confirm the filter_good
query has a double quote after envoy_cluster_name="y
, just before ,kubernetes_namespace="z"
? Otherwise I am afraid the query might not give the expected results.
Finally, I am a bit surprised by the values returned by Prometheus. An SLI is usually computed by dividing the number of good events by the number of valid (= good + bad) events. These two numbers are usually integers. Here the logs show floating-point values. I am not a Prometheus expert but is it possible a count
or sum
is missing?
from slo-generator.
@lvaylet, thanks for quick responce.
Sory for typo in filter_good
, change it in issue description, this is a real promQL they are works sometime, sometime return this error.
My example data from Prometheus (we actualy using Thanos Query v0.26.0)
Query:
envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}[1m]
Returns:
envoy_cluster_external_upstream_rq{app="x", envoy_cluster_name="y",kubernetes_namespace="z"}
253117 @1674638934.622
253125 @1674638940.809
253127 @1674638955.809
253162 @1674638970.809
253197 @1674638985.809
Query:
increase(envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}[1m])
Returns:
195.75282786645047
It's some time int, sometime float in different windows, increase
and rate
in PromQL is calculating per-second metrics I think it maybe float
from slo-generator.
Thanks @maksim-paskal. I need to investigate.
For the record, what type of SLI are you computing here? Availability? Also, what does envoy_cluster_external_upstream_rq
represent? Requests with response class or response code, as hinted in this documentation?
from slo-generator.
envoy_cluster_external_upstream_rq
is a upstream counter of specific HTTP response codes (e.g., 201, 302, etc.), we plan to use it to calculate availability of our service. You can simulate this environment with this files, you need Docker:
docker-compose.yml
version: '3.7'
services:
prometheus:
image: prom/prometheus:v2.36.2
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports:
- 9090:9090
command:
- '--config.file=/etc/prometheus/prometheus.yml'
envoy:
image: envoyproxy/envoy:v1.21.5
volumes:
- ./envoy.yml:/etc/envoy/envoy.yaml:ro
ports:
- 10000:10000
prometheus.yml
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'envoy'
metrics_path: /stats/prometheus
static_configs:
- targets: ['envoy:9901']
envoy.yml
admin:
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: some_service }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: some_service
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: some_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 9901
# run prometheus and envoy
docker-compose up
# generate some records, for example with https://github.com/tsenart/vegeta
echo "GET http://localhost:10000/ready" | vegeta attack -duration=60s -output=/dev/null
open Prometheus http://127.0.0.1:9090
increase(envoy_cluster_external_upstream_rq{envoy_response_code="200"}[1d])
from slo-generator.
a workaround could be use good/bad instead of good/valid
from slo-generator.
@maksim-paskal I just discussed the issue with @bkamin29 and @mveroone.
We are pretty sure this behavior is caused by the tiny delay between the two requests (one for good
and another one for valid
). They are called and executed a few milliseconds apart, resulting in the same window length but slightly different start/end times. As a consequence, as we are looking at two different time horizons, the most recent one might have a slightly different number of good/valid events. Also note that the backend itself can be busy ingesting and persisting "old" data points between the two calls, and account for more data points during the second call.
Two options to mitigate this behavior:
- I can work on integrating the @ modifier into the queries sent to Prometheus, so both calls have the exact same time horizon. I can use the timestamp of the report generation request for this value. Just note that this addition will not prevent the provider from ingesting/persisting new data points between the two calls.
- You can use the
query_sli
method instead of thegood_bad_ratio
one, and delegate the computation of the good/valid ratio to Prometheus itself. That would result in a single request and a single call. However, with this approach, you give up on the ability to export the number of good and bad events.
from slo-generator.
Related Issues (20)
- 🐛 [BUG] - Datadog backend - bad_event_counts spikes HOT 4
- SLI Values Sometimes incorrect W/ Google Cloud Monitoring backend (MQF) HOT 1
- 💡 [REQUEST] - Extend Elasticsearch backend to be enable to use Opensearch HOT 2
- 💡 [REQUEST] - Add OpenTelemetry Exporter plugin HOT 1
- 💡 [REQUEST] - Upgrade google-api-python-client HOT 8
- 🐛 [BUG] - `safety check` reports 4 vulnerabilities HOT 1
- 🐛 [BUG] - SLO Generator Cloud Run service in test project crashes continuously
- 💡 [REQUEST] - Document everything going on in GCP project `slo-generator-ci-a2b4`
- 💡 [REQUEST] - Automate dependency updates with `renovate-bot` HOT 1
- 💡 [REQUEST] - Investigate the results of the Scorecards GitHub Action
- 💡 [REQUEST] - Add a "Why?" section to `README.md`
- 💡 [REQUEST] - Multiwindow, multi-burn-rate alerts HOT 2
- 🐛 [BUG] - CI is unable to deploy new images to Cloud Run
- 🐛 [BUG] - Warnings in GitHub Actions workflows
- 🐛 [BUG] - Similar GitHub Actions workflows have different triggers
- 💡 [REQUEST] - Make the Docker image smaller HOT 1
- 💡 [REQUEST] - Prometheus SLO Recording Rule examples HOT 1
- 🐛 [BUG] - `safety check` fails during CI HOT 2
- 🐛 [BUG] - `safety` reports CVE in `pip` with Python 3.11 HOT 7
- 🐛 [BUG] - latest releases aren't pushed into the GCR HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from slo-generator.