Comments (4)
If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost?
I am struggling to understand best practices around scaling the hash ring. If http_requests_total{code=200"}
on the Receiver goes to 0, does this imply that no metrics are being ingested?
from thanos-receive-controller.
Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload.
Will try your suggestion and see how it works. But being able to increase the number of replicas on the fly dynamically is a real need ofcourse.
BTW @bwplotka do we have any recommendations on running odd vs even number of replicas.
from thanos-receive-controller.
#70 might help :)
from thanos-receive-controller.
We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum.
We use a "Chaosmonkey" script that terminate randomly 1 ec2 instance per day from our EKS cluster.
It took approximately 30 minutes for the quorum to be restored (no manual actions).
Logs
level=error ts=2022-01-03T15:48:28.468568897Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
level=error ts=2022-01-03T16:00:15.101622584Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
level=error ts=2022-01-03T16:03:05.711160692Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
level=error ts=2022-01-03T16:07:22.526307825Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
I see 2 things here :
- eliminate or reduce downtime when there are movements of pods, like scaling (this issue)
- identify primary receivers of the quorum to schedule them on different nodes and also forward to a live primary (I can maybe create an other issue)
(I miss knowledge on how it's working internally)
from thanos-receive-controller.
Related Issues (20)
- Data race in Controller, sending on closed channel
- Vulnerabilities in latest docker image
- Reconciliation loop fails on save hashring
- Issues updating Pod annotation during reconcilliation HOT 2
- --allow-dynamic-scaling does not respond to pod disruptions HOT 2
- Make sure that the generated URLs are actually correct
- RFE: Thanos operator HOT 1
- Multiple statefulsets with same tenant label causing undefined behavior
- Add "vendor" directory to project HOT 1
- Make lint fails on master HOT 1
- Image runs as root
- Bump Client-Go version to fix bug workaround
- Multiple receivers in one controller HOT 1
- Got receive error HOT 1
- Move to Different CI HOT 1
- Question : Labelling Thanos receivers statefulset HOT 1
- Docker - Support Arm Architecture HOT 1
- Proposal: Move to Thanos-community or Thanos orgs. HOT 1
- Dependency version warning
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thanos-receive-controller.