Git Product home page Git Product logo

Comments (4)

michael-burt avatar michael-burt commented on June 7, 2024 1

If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost?

I am struggling to understand best practices around scaling the hash ring. If http_requests_total{code=200"} on the Receiver goes to 0, does this imply that no metrics are being ingested?

from thanos-receive-controller.

bjoydeep avatar bjoydeep commented on June 7, 2024

Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload.
Will try your suggestion and see how it works. But being able to increase the number of replicas on the fly dynamically is a real need ofcourse.
BTW @bwplotka do we have any recommendations on running odd vs even number of replicas.

from thanos-receive-controller.

spaparaju avatar spaparaju commented on June 7, 2024

#70 might help :)

from thanos-receive-controller.

r0mdau avatar r0mdau commented on June 7, 2024

We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum.
We use a "Chaosmonkey" script that terminate randomly 1 ec2 instance per day from our EKS cluster.

It took approximately 30 minutes for the quorum to be restored (no manual actions).

Logs

level=error ts=2022-01-03T15:48:28.468568897Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:00:15.101622584Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:03:05.711160692Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:07:22.526307825Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"

I see 2 things here :

  • eliminate or reduce downtime when there are movements of pods, like scaling (this issue)
  • identify primary receivers of the quorum to schedule them on different nodes and also forward to a live primary (I can maybe create an other issue)

(I miss knowledge on how it's working internally)

from thanos-receive-controller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.