Comments (11)
This might be annoying. ceph_exporter talks to the cluster via librados, which AFAIK doesn't actually provide any sort of connect timeout functionality itself.
from ceph_exporter.
But couldn't the ceph_exporter timeout after some time?
from ceph_exporter.
Somehow, I guess :-) I was just thinking that might be annoying to implement in ceph_exporter itself if there's not already some suitable timeout functionality inside librados. But I'm speculating, really, not being familiar enough with the codebase.
from ceph_exporter.
Ok, in our experience, you cannot timeout librados. The only option is to kill the process. E.g. by spawning a new thread calling kill(getpid(), sig)
after a minute or so.
from ceph_exporter.
Meanwhile has anyone tried playing with rados_{mon,osd}_op_timeout
and see if it does their bidding?
from ceph_exporter.
Do we really want to force users to set the general librados
timeout to fix an issue with monitoring? In any case, afaik this timeout does not work in all cases, like if you shutdown too many OSDs, RBD monitoring will hang indefinitely.
from ceph_exporter.
Looking into the prometheus docs, it seems like this should be handled more gracefully:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes
from ceph_exporter.
Another reason to address this: After the cluster is down for a while the ceph_exporter will consume all the file handles it is allowed to open. This is then reported in the syslog.
I see two ways of resolving this:
-
Destroy the cluster handle after each run and re-create it on the next scrape. We should get an error there if the cluster is down. The drawback is that this puts more load on the monitors, but it shouldn't be too bad if not run too often (I think once a minute would be fine imho). Essentially this would behave like runing
ceph -s
every minute. -
Put the actual command execution in a child process and monitor that. If it doesn't return within a timeout, put out the appropriate scrape results and tear down the cluster handle.
I'd favour solution 2 but I suspect this would be more complex to implement. What do you guys think?
from ceph_exporter.
@sebastian-philipp It is assumed that ceph_exporter
is treated as the client of the system and thus expects to have a separate configuration of its own (along with a separate read-only auth user ideally). A container is provided to make it easier to pick a configuration that doesn't need to overlap with your production one that is used for MONs or OSDs. If it works, I don't see any issue using librados
timeout since there are no data-path ops that can accidentally be sent over it. It should apply even when several OSDs are down, because the timeout is applied per connection to individual OSD. I would have really liked a better way of injecting timeouts in Ceph calls but we have to make do with what we have.
@jan--f Agreed, that is indeed bad. Using timeouts to allow reclamation should help solve the resource consumption issue to some degree. The problem with recreating handles is Prometheus exporters are not allowed to control scraping intervals. The values for those are decided on the server, and I think it's important we make the best-effort to provide the values at granularities they might be needed across all use-cases. Option 2 sounds better where we have data being gathered within a goroutine that runs separately to the main loop. I will take a swipe at implementing it, but if you already have something in works feel free to make a PR.
This also won't be an issue long-term, because Luminous will expose this information via ceph-mgr
. That way regardless of the state of the cluster, as long as the mgr is up we should be able to continually view the state of the system.
from ceph_exporter.
@neurodrone Please go ahead. I'm fairly ignorant when it comes to go, so it would take me significant time and effort to come up with something.
I also think fixing this will have serious benefits even in a longer time frame, since many people run older ceph versions. Pretty sure there are still some cluster from the hammer (0.9x) release up and running. Upgrading a running ceph cluster can be daunting.
from ceph_exporter.
Fixed by #80
from ceph_exporter.
Related Issues (20)
- ceph_osd_weight metric return REWEIGHT instead of WEIGHT HOT 4
- ceph_health_status_interp's status HOT 2
- can not build HOT 1
- does 2.0.7-luminous support ceph jewel ? HOT 1
- curl <host>:9128/metrics hangs HOT 9
- osdmap health details are not not found when monitoring ceph octopus with 3.0.0-nautilus HOT 5
- ceph_exporter gather error HOT 3
- Prometheus Exporters Hub by this repository! Thanks! :) HOT 1
- Add new DAEMON_OLD_VERSION flag to health_status_interp HOT 2
- Support OSD operations Latency for Nautilus release HOT 4
- activating TLS on exporter HOT 2
- Cannot connect to ceph cluster: permission denied HOT 6
- compile get error: "could not determine kind of name for C.rados_mgr_command" HOT 3
- add bucket sharks and obj nums metrics support HOT 1
- Pgs stats HOT 2
- add bucket usage collector HOT 4
- fatal error: rados/librados.h: No such file or directory HOT 1
- Add Docs Listing the Exported Metrics HOT 4
- Update image build workflow to tag images with git tags HOT 3
- go build failed v4.2.0 with go version go1.20.4 linux/amd6 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ceph_exporter.