Git Product home page Git Product logo

Comments (11)

tserong avatar tserong commented on May 27, 2024

This might be annoying. ceph_exporter talks to the cluster via librados, which AFAIK doesn't actually provide any sort of connect timeout functionality itself.

from ceph_exporter.

jan--f avatar jan--f commented on May 27, 2024

But couldn't the ceph_exporter timeout after some time?

from ceph_exporter.

tserong avatar tserong commented on May 27, 2024

Somehow, I guess :-) I was just thinking that might be annoying to implement in ceph_exporter itself if there's not already some suitable timeout functionality inside librados. But I'm speculating, really, not being familiar enough with the codebase.

from ceph_exporter.

sebastian-philipp avatar sebastian-philipp commented on May 27, 2024

Ok, in our experience, you cannot timeout librados. The only option is to kill the process. E.g. by spawning a new thread calling kill(getpid(), sig) after a minute or so.

from ceph_exporter.

neurodrone avatar neurodrone commented on May 27, 2024

Meanwhile has anyone tried playing with rados_{mon,osd}_op_timeout and see if it does their bidding?

from ceph_exporter.

sebastian-philipp avatar sebastian-philipp commented on May 27, 2024

Do we really want to force users to set the general librados timeout to fix an issue with monitoring? In any case, afaik this timeout does not work in all cases, like if you shutdown too many OSDs, RBD monitoring will hang indefinitely.

from ceph_exporter.

jan--f avatar jan--f commented on May 27, 2024

Looking into the prometheus docs, it seems like this should be handled more gracefully:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

from ceph_exporter.

jan--f avatar jan--f commented on May 27, 2024

Another reason to address this: After the cluster is down for a while the ceph_exporter will consume all the file handles it is allowed to open. This is then reported in the syslog.

I see two ways of resolving this:

  1. Destroy the cluster handle after each run and re-create it on the next scrape. We should get an error there if the cluster is down. The drawback is that this puts more load on the monitors, but it shouldn't be too bad if not run too often (I think once a minute would be fine imho). Essentially this would behave like runing ceph -s every minute.

  2. Put the actual command execution in a child process and monitor that. If it doesn't return within a timeout, put out the appropriate scrape results and tear down the cluster handle.

I'd favour solution 2 but I suspect this would be more complex to implement. What do you guys think?

from ceph_exporter.

neurodrone avatar neurodrone commented on May 27, 2024

@sebastian-philipp It is assumed that ceph_exporter is treated as the client of the system and thus expects to have a separate configuration of its own (along with a separate read-only auth user ideally). A container is provided to make it easier to pick a configuration that doesn't need to overlap with your production one that is used for MONs or OSDs. If it works, I don't see any issue using librados timeout since there are no data-path ops that can accidentally be sent over it. It should apply even when several OSDs are down, because the timeout is applied per connection to individual OSD. I would have really liked a better way of injecting timeouts in Ceph calls but we have to make do with what we have.

@jan--f Agreed, that is indeed bad. Using timeouts to allow reclamation should help solve the resource consumption issue to some degree. The problem with recreating handles is Prometheus exporters are not allowed to control scraping intervals. The values for those are decided on the server, and I think it's important we make the best-effort to provide the values at granularities they might be needed across all use-cases. Option 2 sounds better where we have data being gathered within a goroutine that runs separately to the main loop. I will take a swipe at implementing it, but if you already have something in works feel free to make a PR.

This also won't be an issue long-term, because Luminous will expose this information via ceph-mgr. That way regardless of the state of the cluster, as long as the mgr is up we should be able to continually view the state of the system.

from ceph_exporter.

jan--f avatar jan--f commented on May 27, 2024

@neurodrone Please go ahead. I'm fairly ignorant when it comes to go, so it would take me significant time and effort to come up with something.
I also think fixing this will have serious benefits even in a longer time frame, since many people run older ceph versions. Pretty sure there are still some cluster from the hammer (0.9x) release up and running. Upgrading a running ceph cluster can be daunting.

from ceph_exporter.

jan--f avatar jan--f commented on May 27, 2024

Fixed by #80

from ceph_exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.