Comments (3)
Do you have this option enabled https://github.com/thanos-io/thanos/blob/main/cmd/thanos/query.go#L212? It should solve your issue.
from thanos.
Do you have this option enabled https://github.com/thanos-io/thanos/blob/main/cmd/thanos/query.go#L212? It should solve your issue.
Yes, i tried with store.response-timeout=15s and then 5s, but it still has similar elevated latency. I do have query.timeout=30s.
i notice in the code that store.response-timeout is a Timer for a kinda manual timeout while waiting for cl.Recv call 1. So if the cl.Recv call itself is taking longer than that amount (e.g. 40s) then the call would be that long as well. And the grpc context for cl.Recv doesnt have timeout specified AFAICT.
Ref:
from thanos.
I've dive through the code and I think I found what the problem is.
When a store is listed as store-strict
, it will always be part of the active store set, as called out in this proposed (and completed) design 202001-thanos-query-health-handling.
However, since it's always be part of the active store set, they will always get incoming queries as long as the labels published by that store endpoint matches.
The problem is when that store is completely down like in my example above, the Dial's ClientConn
would keep retrying to establish connection (in the background), on the application layer, grpc calls to that service is not aware of this fact and keep on innocently sending grpc requests with a regular timeout (dictated by Thanos' CLI arg query.timeout
-- which is default to 2m
, and in my case I've set it to 30s
). Since there's no actual pod behind the store, it would get timed out instead of exceeding the store.response-timeout
(see ref 1).
This is seems fine for individual queries since queries are timing out "as expected". However, when there are sizable number of queries that would go through this path (and timing out), it will fill up the "queue" from query-frontend
, create a cascading failure scenario where other queries -- even though they don't go through this path (of hitting the strict store that's completely down) -- would also start timing out.
Proposed fix:
Since the healthiness of a store is actually being checked every 5s (see code ref 2), that means at query time, the query flow knows which store is unhealthy (in this case timeout). The original proposal (see ref 3) is to not ignore the strict store (and return partial response) instead of completely forgetting about the store (and return an illusional success response). We can still accommodate that idea but fail fast on unhealthy strict-store by checking whether the strict store is unhealthy and not even attempt to send gRPC request to it, and return an error.
If it SGTY I can prep a PR to introduce this fail-fast behavior.
Ref:
(1): the store.response-timeout
only gets triggered if there was some thing sent back from the server and that's when the response-timeout is used to do early cut off if consecutive data received from the server is longer than that number:
thanos/pkg/store/proxy_heap.go
Line 700 in 968899f
(2): Check endpoint set every 5s
Line 519 in 194e1fa
(3): Original proposal of store-strict
: https://thanos.io/v0.29/proposals-done/202001-thanos-query-health-handling.md/
from thanos.
Related Issues (20)
- Unauthorized errors for some endpoints with query-frontend HOT 1
- External labels not applied to alerts HOT 3
- 0.35: Panic with query mode distributed HOT 1
- query: Passing `THANOS-TENANT: <tenant>` header has no effect unless `--query.enforce-tenancy` is set HOT 3
- query: different results for rate function when not dedup or using implicit step interval HOT 8
- Thanos compactor causing huge memory spikes when compacting raw blocks HOT 2
- Ruler evaluation warning false alarm caused by engine warnings HOT 1
- Expose when Rule alert labels are being overwritten by its --label(s)
- Read value of remote_user in Slow Query Logs of Query Frontend from a HTTP header HOT 3
- Thanos Receive doesn't announce external_labels which are set in hashrings.json when it works in routing and ingesting mode. HOT 1
- Issue with deduplication alogrithm in Thanos HOT 4
- Query Stats Returned with query including query bytes fetched HOT 5
- Max and min pointed at Sidecars not working on 0.35 HOT 15
- `ThanosSidecarBucketOperationsFailed` alert is flaky
- PR Title Validation
- Thanos Receive Pod is crashing with Readiness and livness Probe Failed
- Thanos ruler vs. eventual consistency of metrics
- Can Huawei's OBS storage be supported? HOT 1
- Thanos React-app : Proxy server for thanos-query
- Query: update of endpoint failed...context deadline exceeded
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thanos.