When the cluster is unavailable (because e.g. 2/3 MONs are down) the ceph_exporter see

Meanwhile has anyone tried playing with <a href="https://github.com/ceph/ceph/blob/mas

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

No timeout when cluster is down about ceph_exporter HOT 11 CLOSED

digitalocean commented on May 27, 2024

No timeout when cluster is down

from ceph_exporter.

Comments (11)

tserong commented on May 27, 2024

This might be annoying. ceph_exporter talks to the cluster via librados, which AFAIK doesn't actually provide any sort of connect timeout functionality itself.

from ceph_exporter.

jan--f commented on May 27, 2024

But couldn't the ceph_exporter timeout after some time?

from ceph_exporter.

tserong commented on May 27, 2024

Somehow, I guess :-) I was just thinking that might be annoying to implement in ceph_exporter itself if there's not already some suitable timeout functionality inside librados. But I'm speculating, really, not being familiar enough with the codebase.

from ceph_exporter.

sebastian-philipp commented on May 27, 2024

Ok, in our experience, you cannot timeout librados. The only option is to kill the process. E.g. by spawning a new thread calling kill(getpid(), sig) after a minute or so.

from ceph_exporter.

neurodrone commented on May 27, 2024

Meanwhile has anyone tried playing with rados_{mon,osd}_op_timeout and see if it does their bidding?

from ceph_exporter.

sebastian-philipp commented on May 27, 2024

Do we really want to force users to set the general librados timeout to fix an issue with monitoring? In any case, afaik this timeout does not work in all cases, like if you shutdown too many OSDs, RBD monitoring will hang indefinitely.

from ceph_exporter.

jan--f commented on May 27, 2024

Looking into the prometheus docs, it seems like this should be handled more gracefully:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

from ceph_exporter.

jan--f commented on May 27, 2024

Another reason to address this: After the cluster is down for a while the ceph_exporter will consume all the file handles it is allowed to open. This is then reported in the syslog.

I see two ways of resolving this:

Destroy the cluster handle after each run and re-create it on the next scrape. We should get an error there if the cluster is down. The drawback is that this puts more load on the monitors, but it shouldn't be too bad if not run too often (I think once a minute would be fine imho). Essentially this would behave like runing ceph -s every minute.
Put the actual command execution in a child process and monitor that. If it doesn't return within a timeout, put out the appropriate scrape results and tear down the cluster handle.

I'd favour solution 2 but I suspect this would be more complex to implement. What do you guys think?

from ceph_exporter.

neurodrone commented on May 27, 2024

@sebastian-philipp It is assumed that ceph_exporter is treated as the client of the system and thus expects to have a separate configuration of its own (along with a separate read-only auth user ideally). A container is provided to make it easier to pick a configuration that doesn't need to overlap with your production one that is used for MONs or OSDs. If it works, I don't see any issue using librados timeout since there are no data-path ops that can accidentally be sent over it. It should apply even when several OSDs are down, because the timeout is applied per connection to individual OSD. I would have really liked a better way of injecting timeouts in Ceph calls but we have to make do with what we have.

@jan--f Agreed, that is indeed bad. Using timeouts to allow reclamation should help solve the resource consumption issue to some degree. The problem with recreating handles is Prometheus exporters are not allowed to control scraping intervals. The values for those are decided on the server, and I think it's important we make the best-effort to provide the values at granularities they might be needed across all use-cases. Option 2 sounds better where we have data being gathered within a goroutine that runs separately to the main loop. I will take a swipe at implementing it, but if you already have something in works feel free to make a PR.

This also won't be an issue long-term, because Luminous will expose this information via ceph-mgr. That way regardless of the state of the cluster, as long as the mgr is up we should be able to continually view the state of the system.

from ceph_exporter.

jan--f commented on May 27, 2024

@neurodrone Please go ahead. I'm fairly ignorant when it comes to go, so it would take me significant time and effort to come up with something.
I also think fixing this will have serious benefits even in a longer time frame, since many people run older ceph versions. Pretty sure there are still some cluster from the hammer (0.9x) release up and running. Upgrading a running ceph cluster can be daunting.

from ceph_exporter.

jan--f commented on May 27, 2024

Fixed by #80

from ceph_exporter.

No timeout when cluster is down about ceph_exporter HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent