digitalocean / ceph_exporter Goto Github PK

View Code? Open in Web Editor NEW

393.0 393.0 144.0 800 KB

Prometheus exporter that scrapes meta information about a ceph cluster.

License: Apache License 2.0

Go 99.25% Dockerfile 0.75%

ceph_exporter's People

Contributors

Stargazers

Watchers

Forkers

ricebeans cagedmantis acalephstorage treed extoor magicrobotmonkey bobrik stonezyg cristicalin chiennd teutonet finn-no deathowl larsholowko-zz tinytub wyatt88 craig08 skloeckner tserong aikon77 emiliod phonglh79 gedeiswara jan--f m247 fitschn utkarshmani1997 jeroen92 alekschumakov88 wwalker kaeltis travmi rjfd tnextday fffffreedom journeymidnight shoghicp oritwas fmoctezuma frankf-cgn oliveiradan oleg-glushak pryorda jackice soseth niedbalski rainf930125 jerryhax phpor muhazzz faithdy huangyanhong xr0118 niltonkummer faebd7 ralfonso benyanke bobby0809 zhsker zhouyuanyuana rumititim gmemcc nnrg123 dmatosl kevinzheng001 tbregolin raobian twilight327426371 huo951 jianweixs penglaiyxy click2cloud-gamma mkkie constructorfleet andydodo alram shhui hanfengzhe-hi weilaizhou jacobsy greatbn enixdark neurodrone yuezhu thelostlu junsheng-wu woraser zimzim9 iceleegit packyzbq banna2019 snow842 yoshikado jiachu hopkings2008 nickjanus edggge zhouyj0213 rayallen347 eb3095

ceph_exporter's Issues

ceph_cluster_objects and ceph_osd_pgs are not displaying values correctly

Hey,

We have issues with ceph_cluster_objects and ceph_osd_pgs that those are not being displayed correctly.

First of all - ceph_cluster_objects. If it shows total number of objects in Ceph, why in our graphs ceph_misplaced_objects > ceph_cluster_objects?

ceph_osd_pgs also is not looking to give us correct values. In our DEV cluster, number of placement groups is always the same - 10560, but the graph is showing this number varying:

Nevertheless, sum(ceph_osd_pgs) current value is 30233 and even subtracting by the number of replicas (2), does not equals what ceph -s provides - 10560.

ceph-exporter is running on Docker container and we're using https://grafana.net/dashboards/917 dashboard.

Works with Grafana?

Hello there,

I am trying to use the docker image to have Grafana scrape the data from it and use the ceph dashboard here:

https://grafana.com/dashboards/917

However, I believe the docker image is missing the main prometheus server. I am wondering if Grafana is supported out of the box with this as a data source?

Do I need to setup a full prometheus server to perform scrapes?

The reason I ask is because when going to http://$dockerhostIP:9128/metrics the metrics are clearly available.

Format of ceph health changes jewel -> luminous

The PR for the new format was merged to master already. So this will soon appear in distro packages. This will cause the ceph_exporter to break.
Details about the new format can be found here and here.

Ideally the exporter could be adjusted to be able to deal with both formats.

Can we consider to add "fsid" as a filter variable alongside with "cluster"?

Guys, in my work environment, I have multiple ceph clusters. Because I'm new to Ceph, when I installed these clusters, I didn't modified their cluster names. As a result, they all have the same cluster name, so I cannot use "cluster" to distinguish them. By far, there seems no easy way to modify ceph cluster name, too.

The solution jumped into my mind is adding "fsid" as a new variable. It should be unique. Of course, shortcoming is also obvious: it will be harder to tell which is which, but, at least, it should've worked.

What do you guys think about it?

Will ceph mimic be supported soon?

The latest version shows supporting luminous. Will it work for ceph v13.2.2?

Build Public Container

On OSX the Ceph packages are not available to build the binary and create a container. We use Ceph in our production systems, but installing golang on the nodes to build the package is something to be avoided.

This could be mitigated by using a CI system (Travis?) to do the builds and create a public container for the exporter (which is somewhat implied by the tag digitalocean/ceph_exporter used to build the image)

build fails

Trying to build and test the ceph_exporter but make fails:

 make
Go version 1.5.3 required but not found in PATH.
About to download and install go1.5.3 to /home/jan/work/code/go/src/github.com/digitalocean/ceph_exporter/.build/go1.5.3
Abort now if you want to manually install it system-wide instead.

mkdir -p .build
# The archive contains a single directory called 'go/'.
curl -L https://golang.org/dl/go1.5.3.linux-amd64.tar.gz | tar -C .build -xzf -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    87  100    87    0     0    285      0 --:--:-- --:--:-- --:--:--   286
100 76.4M  100 76.4M    0     0  2218k      0  0:00:35  0:00:35 --:--:-- 2672k
rm -rf /home/jan/work/code/go/src/github.com/digitalocean/ceph_exporter/.build/go1.5.3
mv .build/go /home/jan/work/code/go/src/github.com/digitalocean/ceph_exporter/.build/go1.5.3
GO15VENDOREXPERIMENT=1 GOROOT=/home/jan/work/code/go/src/github.com/digitalocean/ceph_exporter/.build/go1.5.3 /home/jan/work/code/go/src/github.com/digitalocean/ceph_exporter/.build/go1.5.3/bin/go build  -o ceph_exporter
# github.com/digitalocean/ceph_exporter/vendor/github.com/ceph/go-ceph/rados
cannot load DWARF output from $WORK/github.com/digitalocean/ceph_exporter/vendor/github.com/ceph/go-ceph/rados/_obj//_cgo_.o: decoding dwarf section info at offset 0x4: unsupported version 0
make: *** [Makefile.COMMON:86: ceph_exporter] Error 2
make  8.05s user 2.03s system 23% cpu 42.795 total

System-wide go version is go-2:1.8.3-1 and is ignored.

ceph_osd_weight metric return REWEIGHT instead of WEIGHT

on ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) the metric ceph_osd_crush_weight returns osd WEIGHT value
on ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) the metric ceph_osd_weight returns osd REWEIGHT value

I expect WEIGHT value to be returned so it seems to be a bug.

stale metrics exported

Metrics that carry pool names or daemon IDs as a labels stay after the respective entity has been deleted from the cluster.
To reproduce, create a pool until the ceph_exporter exports it, then delete it. The exporter will keep exporting the last pool metrics for said pool until restarted.

can't load package: /usr/local/go/src/ceph_exporter/exporter.go

When trying to run "go install" inside ceph_exporter I'm getting the following error msg:

can't load package: /usr/local/go/src/ceph_exporter/exporter.go:26:2: non-standard import "github.com/ceph/go-ceph/rados" in standard package "ceph_exporter"

root@cm03:/usr/local/go/src/github.com/ceph/go-ceph/rados# ls -ltr
total 40
-rw-r--r-- 1 root root 1989 May 18 09:15 rados.go
-rw-r--r-- 1 root root 23737 May 18 09:15 ioctx.go
-rw-r--r-- 1 root root 57 May 18 09:15 doc.go
-rw-r--r-- 1 root root 8174 May 18 09:15 conn.go

Any ideas? Thanks in advance,

jewel not collect monitor data

Hello!
root@ceph-osd-mon02:~# ceph --version
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
Using exporter in docker image
docker run -v /etc/ceph:/etc/ceph -d --net=host digitalocean/ceph_exporter
OSD metrics collect normal, but no monitor metrics. In logs :
root@ceph-osd-mon02:~# docker logs -f a70ac49a79d3
2018/07/19 16:29:19 Starting ceph exporter on ":9128"
2018/07/19 16:29:29 failed collecting monitor metrics: rados: Invalid argument
2018/07/19 16:41:56 failed collecting monitor metrics: rados: Invalid argument

ceph.conf - https://pastebin.com/sYPgqPSR

ceph_osds_down is returning incorrect count.

ceph_osds_down is returning zero (0) when in fact there are a number down. We are running Jewel. We can use a workaround of ceph_osds - ceph_osds_up because ceph_osds_up seems to be returning correctly.

No metrics with luminous

I just compiled the most recent version and I'm not getting any ceph metrics.
Using ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)

Log just shows
2017/09/17 22:45:27 Starting ceph exporter on ":9128

These are the only metrics I'm getting from ceph_exporter:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 6
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.9"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 837360
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 837360
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.443282e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 424
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 169984
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 837360
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 933888
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.851392e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 8410
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.78528e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 21
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 8834
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 6944
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 29184
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 32768
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 797478
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 360448
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 360448
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 5.605624e+06
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.07
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.7762688e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.50568112768e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.87976192e+08

OSD Latency are not correctly collected in Nautilus

In Nautilus the structure of the output of ceph osd perf has changed a little bit, and therefore it looks like collecting OSD latency metrics is failing.
The following two metrics are not collected.

commit_latency_ms
apply_latency_ms

As far as I have seen from the commit log, there was a fix for this but got reverted somehow.
1f2df8d#diff-1536b81b5897f95267830a7c215ad5ab

latest (as of 2016-12-29) image does not work on newer Ceph cluster

Hey!

Just now updated our ceph-exporter to latest image (previously our image was 3 weeks old) and we've noticed that while it works with Ceph cluster version 10.2.2 it gets connection timeouts with newer ones i.e. 10.2.3:

# docker logs -f ceph-env-region1
2016/12/29 11:06:35 cannot connect to ceph cluster: rados: Connection timed out
2016/12/29 11:11:35 cannot connect to ceph cluster: rados: Connection timed out
2016/12/29 11:16:35 cannot connect to ceph cluster: rados: Connection timed out
2016/12/29 11:21:35 cannot connect to ceph cluster: rados: Connection timed out
2016/12/29 11:26:36 cannot connect to ceph cluster: rados: Connection timed out
2016/12/29 11:31:36 cannot connect to ceph cluster: rados: Connection timed out

Have you tried it with newer version than 10.2.2 ? Also it would be cool if you would tag any exporter releases, so we won't have any headaches reverting back :)

Luminous compatibillity

Hi,
Any new regarding Luminous full compatibillity ?

Collect Metrics using Ceph Exporter without exposing it on port

I want to just collect metrics using Ceph Exporter Methodology and use those values for other purpose but I do not want to expose it on localhost:9128. How it can be done using Ceph Exporter Functions or Methods defined here as dependency in other project.

ceph-12 io have no value

I have deploy the ceph_exporter on k8s with two version ,jewel and luminous
For ceph version 12.2.5 luminous, the io and thoughput value is always 0
Please tell me how to discover this problem
TKS

[ERROR] Unable to collect data from ceph osd df rados: Invalid argument

After building the ceph_exporter go project, I try to run the bin ceph_exporter file, and I'm getting the following err msgs:

root@cm03:/# /usr/local/go/bin/ceph_exporter
2017/05/18 14:33:48 Starting ceph exporter on ":9128"
2017/05/18 14:33:55 [ERROR] cannot extract total bytes: strconv.ParseFloat: parsing "": invalid syntax
2017/05/18 14:33:55 [ERROR] cannot extract used bytes: strconv.ParseFloat: parsing "": invalid syntax
2017/05/18 14:33:55 [ERROR] cannot extract available bytes: strconv.ParseFloat: parsing "": invalid syntax
2017/05/18 14:33:55 failed collecting cluster health metrics: strconv.ParseFloat: parsing "": invalid syntax
2017/05/18 14:33:55 [ERROR] Unable to collect data from ceph osd df rados: Invalid argument
2017/05/18 14:33:55 failed collecting osd metrics: rados: Invalid argument

Any ideas?

jewel -> luminous ceph_health_status value to 1 will not change ?

ceph_client_io_(read|write)_ops always zero

Hi,

with ceph hammer I get inclomplete values for the ioops

# HELP ceph_cache_promote_io_ops Total cache promote operations measured per second
# TYPE ceph_cache_promote_io_ops gauge
ceph_cache_promote_io_ops 0
# HELP ceph_client_io_ops Total client ops on the cluster measured per second
# TYPE ceph_client_io_ops gauge
ceph_client_io_ops 111
# HELP ceph_client_io_read_ops Total client read I/O ops on the cluster measured per second
# TYPE ceph_client_io_read_ops gauge
ceph_client_io_read_ops 0
# HELP ceph_client_io_write_ops Total client write I/O ops on the cluster measured per second
# TYPE ceph_client_io_write_ops gauge
ceph_client_io_write_ops 0

Is this a problem of the hammer relase?

Regards, Eckebrecht

Build Fails (Docker as well as local)

HI,
Unfortunately building from source is not possible anymore.
After some research i guess this is caused by changes in ceph libs but I'm just a noob here.

Docker build gives me:
# github.com/digitalocean/ceph_exporter/vendor/github.com/ceph/go-ceph/rados
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_objects_list_close':
cgo-gcc-prolog:348:2: warning: 'rados_objects_list_close' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3845:21: note: declared here
CEPH_RADOS_API void rados_objects_list_close(
^
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_objects_list_get_pg_hash_position':
cgo-gcc-prolog:364:2: warning: 'rados_objects_list_get_pg_hash_position' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3836:25: note: declared here
CEPH_RADOS_API uint32_t rados_objects_list_get_pg_hash_position(
^
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_objects_list_next':
cgo-gcc-prolog:384:2: warning: 'rados_objects_list_next' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3841:20: note: declared here
CEPH_RADOS_API int rados_objects_list_next(
^
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_objects_list_open':
cgo-gcc-prolog:403:2: warning: 'rados_objects_list_open' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3833:20: note: declared here
CEPH_RADOS_API int rados_objects_list_open(
^
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_objects_list_seek':
cgo-gcc-prolog:423:2: warning: 'rados_objects_list_seek' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3838:25: note: declared here
CEPH_RADOS_API uint32_t rados_objects_list_seek(
^
cgo-gcc-prolog: In function '_cgo_c6f595483c63_Cfunc_rados_read_op_omap_get_vals':
cgo-gcc-prolog:497:2: warning: 'rados_read_op_omap_get_vals' is deprecated [-Wdeprecated-declarations]
In file included from vendor/github.com/ceph/go-ceph/rados/ioctx.go:6:0:
/usr/include/rados/librados.h:3272:21: note: declared here
CEPH_RADOS_API void rados_read_op_omap_get_vals(rados_read_op_t read_op,
^

Despite the "warning" messages this causes the build to fail. The new docker image does not get an new ceph_exporter binary but uses a binary which is obviously already included.

An attempt to build via "go build" directly from localhost gives me:
# github.com/digitalocean/ceph_exporter/collectors
collectors/conn.go:32:15: undefined: rados.Conn

But rados.Conn can be found and resolved in /vendor. Root cause might be found in recent changes at ceph libs as I mentioned.

Steps to reproduce:

Clone Repo
Make changes to source
docker build .

Does someone know a workaround?

thanks in advance

add daemon osd.X perf dump metrics

Metrics coming from ceph daemon osd.X perf dump are very important to evaluate performace. The only problem I see in the code is that for now there is no other case where you need to call an API for every OSD, as this case requires. Is there any plan to support them?

Exporter fails when some OSDs not online (luminous)

The ceph_exporter from the luminous-2.0.0. branch fails when some OSDs not online. It seems there are duplicates values. Ceph Release is 12.2.7. We using the official docker container with luminous-2.0.0 tag...

An error has occurred during metrics gathering:

4 error(s) occurred:

collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.40" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.35" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.36" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.41" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values

When all OSDs online, everything work as expected.

Missing Misplaced Objects

Both ceph_misplaced_objects and degraded_objects of these show 0, but my ceph -s output looks like.

data:
pools: 3 pools, 364 pgs
objects: 1103M objects, 21756 GB
usage: 47129 GB used, 26597 GB / 73727 GB avail
pgs: 64413611/2314108426 objects degraded (2.784%)
466174533/2314108426 objects misplaced (20.145%)
237 active+clean
106 active+remapped+backfill_wait
15 active+undersized+degraded+remapped+backfill_wait
5 active+undersized+degraded+remapped+backfilling
1 active+remapped+backfilling

Is this normal?

ceph_exporter need to installed on each node ?

ceph_exporter need to installed on each node? or it's enough installed ceph_exporter on one node?

Ceph OSD still present after being removed from the cluster

When removing an OSD from the cluster it does not vanish from the metrics.

Version: 1.0.0

Example:

We removed OSD.230 from the cluster after a complete node failure and the OSD therefor never returned after failure. We removed the osd, its keys and the crush rules. When we reload the exporter the metrics of the removed osd are not exported anymore. The moment we query a node where we did not reload the exporter we do keep getting the results:

ph_osd_avail_bytes{osd="osd.230"} 4.252156604e+12
ceph_osd_bytes{osd="osd.230"} 5.858434628e+12
ceph_osd_crush_weight{osd="osd.230"} 5.456085
ceph_osd_depth{osd="osd.230"} 2
ceph_osd_in{osd="osd.230"} 0
ceph_osd_perf_apply_latency_seconds{osd="osd.230"} 0
ceph_osd_perf_commit_latency_seconds{osd="osd.230"} 0
ceph_osd_pgs{osd="osd.230"} 131
10ceph_osd_reweight{osd="osd.230"} 1

ceph osd tree|grep 230 returns nothing.
ceph auth list |grep 230 returns nothing.

According to our audit logs the osd was removed from the cluster on 2017-05-22, more than 9 days ago.

We removed the osd by:
ceph osd crush remove osd.{osd-num}
ceph auth del osd.{osd-num}
ceph osd rm {osd-num}
ceph osd crush remove {host}

Ceph version:
10.2.5

Restarting exporter process clears the removed OSD's from the result.
We have 10 OSD's per node, all 10 removed OSD's from this node are still reported.

Currently we have an exporter running per ceph monitor node, we have 5 exporters running.

This seems the same issue as issue #48 but we do not see the osd's removed from the metrics after a day.

Wrong utilization and variance when OSD is out

When one of the OSDs is out, ceph osd df -f json displays -nan for utilization and variance.
Like following:

        {
            "id": 4,
            "name": "osd.4",
            "type": "osd",
            "type_id": 0,
            "crush_weight": 0.047791,
            "depth": 2,
            "reweight": 0.000000,
            "kb": 0,
            "kb_used": 0,
            "kb_avail": 0,
            "utilization": -nan,
            "var": -nan,
            "pgs": 50
        }

And the exporter says

2017/04/10 19:51:07 Starting ceph exporter on ":9128"
2017/04/10 19:51:08 failed collecting osd metrics: invalid character 'n' in numeric literal

The ceph_osd_average_utilization variable also gets wrong in this case.

Implement rados gateway stats

Radosgw exposes nice stats. I'm mostly interested in per bucket utilization:

root@dev:/# radosgw-admin bucket stats --bucket complainer

{
    "bucket": "complainer",
    "pool": ".rgw.buckets",
    "index_pool": ".rgw.buckets.index",
    "id": "default.1710998.19",
    "marker": "default.1710998.19",
    "owner": "complainer",
    "ver": "0#8858394",
    "master_ver": "0#0",
    "mtime": "2016-05-23 15:21:12.000000",
    "max_marker": "0#",
    "usage": {
        "rgw.none": {
            "size_kb": 0,
            "size_kb_actual": 0,
            "num_objects": 0
        },
        "rgw.main": {
            "size_kb": 200505199,
            "size_kb_actual": 203796736,
            "num_objects": 1342459
        }
    },
    "bucket_quota": {
        "enabled": false,
        "max_size_kb": -1,
        "max_objects": -1
    }
}

Is it something that ceph_exporter should provide or is it a job for a separate radosgw_exporter?

Add host label to OSDs

Hi,

When debugging OSD latency issues it would be very helpful if there was a label connecting the OSD to the host (like the output of ceph osd tree).

It would also disambiguate if the OSD ever moves to another host.

can not build

[root@VM_0_13_centos ceph_exporter]# go install

github.com/digitalocean/ceph_exporter/vendor/github.com/ceph/go-ceph/rados

vendor/github.com/ceph/go-ceph/rados/ioctx.go:453:2: could not determine kind of name for C.rados_read_op_omap_get_vals2

[root@VM_0_13_centos ceph_exporter]# rpm -qa |grep rbd
librbd1-10.2.5-4.el7.x86_64
librbd1-devel-10.2.5-4.el7.x86_64

[root@VM_0_13_centos ceph_exporter]# rpm -qa |grep rados
librados2-devel-10.2.5-4.el7.x86_64
librados2-10.2.5-4.el7.x86_64

Cannot connect to ceph cluster

When running docker run -v /etc/ceph:/etc/ceph --net=host -it digitalocean/ceph_exporter I get the following error:

cannot connect to ceph cluster: rados: No such file or directory

Am I doing something wrong?

No timeout when cluster is down

When the cluster is unavailable (because e.g. 2/3 MONs are down) the ceph_exporter seems to never return. Since the ceph_exporter is not actually dependend on a running cluster it would be nicer if it could return an appropriate status. Now a monitoring solution has to rely on a prometheus scrape timeout to "detect this"

Questions about configuration and the IO ops metrics

Hi,

After compiling and installing ceph_exporter on one node in our (32 node) cluster, we noticed its not showing IO Ops metrics as expected :

ceph_client_io_ops - Seems to be ok and shows a number
ceph_client_io_read_bytes - Always showing zero
ceph_client_io_read_ops - Always showing zero

What could be wrong here ?

We've installed ceph_exporter on one client node only. Is this exporter meant to be installed on more than one node ? It was not clear from the documentation, would appreciate some clarity here.

Thanks

Update endpoint `/metrics` handler exporter.go

You guys are using prometheus.Handler() which has deprecated.

Cannot read property 'result' of undefined

Exporter displays `0` values for ceph_osd_up even if OSD has been removed from cluster

We've noticed the issue with ceph_exporter that it still displays ceph_osd_up metric with value (0), even if that OSD has been removed from Cluster.

Does the exporter have counter resets or something like that?

We're using Docker container with ceph_exporter version 0.1.0.

Thanks!

Can not build this project.

I download this project and flow below steps:

cd ceph_exporter-master
go get

Here is the error:

./exporter.go:49: cannot use collectors.NewClusterUsageCollector(conn, cluster) (type *collectors.ClusterUsageCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
	*collectors.ClusterUsageCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
		have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
		want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:50: cannot use collectors.NewPoolUsageCollector(conn, cluster) (type *collectors.PoolUsageCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
	*collectors.PoolUsageCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
		have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
		want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:51: cannot use collectors.NewClusterHealthCollector(conn, cluster) (type *collectors.ClusterHealthCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
	*collectors.ClusterHealthCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
		have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
		want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:52: cannot use collectors.NewMonitorCollector(conn, cluster) (type *collectors.MonitorCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
	*collectors.MonitorCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
		have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
		want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:53: cannot use collectors.NewOSDCollector(conn, cluster) (type *collectors.OSDCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
	*collectors.OSDCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
		have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
		want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)

How I can fix it, Thank so much.

Implement CephFS (MD) metrics

It would be nice to have metrics for the CephFS (POSIX mode) metadata server.

See: http://docs.ceph.com/docs/master/cephfs/

ceph_health_status_interp's status

ceph_health_status_interp

HELP ceph_health_status_interp Health status of Cluster, can vary only between 4 states (err:3, critical_warn:2, soft_warn:1, ok:0)

with the source code ,i do not find soft_warn define,does ceph_health_status_interp useful for ceph monitoring?
can someone explain this metric?thank you

Jewel client IOPS not reported correctly

@neurodrone @nickvanw jewel client IOPS still isnt reported correctly after building with 8af54c1 #16 (re: original report #15)

2016/05/17 21:17:03 failed collecting cluster recovery/client io: can't parse units "op"
2016/05/17 21:17:03 failed collecting cluster recovery/client io: can't parse units "op"
2016/05/17 21:17:06 failed collecting cluster recovery/client io: can't parse units "op"
2016/05/17 21:17:06 failed collecting cluster recovery/client io: can't parse units "op"

Support for Jewel?

I see some complaints when querying /metrics when connected to a ceph jewel cluster:

2016/05/17 16:40:01 failed collecting cluster recovery/client io: can't parse units "op"

It is probably balking at this line in the ceph -s output: client io 20166 kB/s wr, 0 op/s rd, 42 op/s wr. Older versions only reported op/s, but jewel reports ops read and ops write.

go build error

[root@client4ha ceph_exporter-1.0.0]# go build

_/home/centos/ceph_exporter-1.0.0

./exporter.go:49: not enough arguments in call to collectors.NewClusterUsageCollector
have (*rados.Conn)
want (collectors.Conn, string)
./exporter.go:49: cannot use collectors.NewClusterUsageCollector(conn) (type *collectors.ClusterUsageCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.ClusterUsageCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:50: cannot use collectors.NewPoolUsageCollector(conn) (type *collectors.PoolUsageCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.PoolUsageCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:51: cannot use collectors.NewClusterHealthCollector(conn) (type *collectors.ClusterHealthCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.ClusterHealthCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:52: cannot use collectors.NewMonitorCollector(conn) (type *collectors.MonitorCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.MonitorCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:53: cannot use collectors.NewOSDCollector(conn) (type *collectors.OSDCollector) as type "github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.OSDCollector does not implement "github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)

can ceph_exporter support getting rbd info?

i want to get rbd image info using ceph_exporter, but after reading the source code, i found it not support.
and there is a question confused me:
as you know, the command we used to get info of rbd is started with"rbd", while in ceph_exporter code, it seems the command are started with "ceph"?
func (o *OSDCollector) cephOSDDFCommand() []byte {
cmd, err := json.Marshal(map[string]interface{}{
"prefix": "osd df",
"format": "json",
})
if err != nil {
panic(err)
}
return cmd
}

How to implement multiple clusters

If this need to go somewhere else I apologize; please point me in the correct direction.

I have two ceph clusters, both config files are in /etc/ceph and they are name ceph.conf and ceph2.conf. However, I am unable to find documentation regarding what else I may have to do to send prometheus information on both clusters. Both config files do appear in the container. Any help would be appreciated.

does 2.0.7-luminous support ceph jewel ?

Go Build ERROR

ceph_exporter go build error

go version go1.9.4 linux/amd64

[root@rg1-ceph01 ~/go/src/ceph_exporter]# go build

./exporter.go:85:39: cannot use collectors.NewClusterUsageCollector(conn, cluster) (type *collectors.ClusterUsageCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.ClusterUsageCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:86:36: cannot use collectors.NewPoolUsageCollector(conn, cluster) (type *collectors.PoolUsageCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.PoolUsageCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:87:40: cannot use collectors.NewClusterHealthCollector(conn, cluster) (type *collectors.ClusterHealthCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.ClusterHealthCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:88:34: cannot use collectors.NewMonitorCollector(conn, cluster) (type *collectors.MonitorCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.MonitorCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:89:30: cannot use collectors.NewOSDCollector(conn, cluster) (type *collectors.OSDCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in array or slice literal:
*collectors.OSDCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:95:24: cannot use collectors.NewRGWCollector(cluster, config, false) (type *collectors.RGWCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in append:
*collectors.RGWCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
./exporter.go:100:24: cannot use collectors.NewRGWCollector(cluster, config, true) (type *collectors.RGWCollector) as type "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector in append:
*collectors.RGWCollector does not implement "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Collector (wrong type for Collect method)
have Collect(chan<- "github.com/digitalocean/ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)
want Collect(chan<- "ceph_exporter/vendor/github.com/prometheus/client_golang/prometheus".Metric)

IO metrics are 0

We have Ceph Luminous cluster running on Ubuntu 16.04.3 LTS. We also use ceph_exporter via docker (6 days old), but io metrics are always giving zero values.
Any advice how to solve this?

Thank you

ceph_cache_evict_io_bytes{cluster="ceph"} 0
ceph_cache_flush_io_bytes{cluster="ceph"} 0
ceph_cache_promote_io_ops{cluster="ceph"} 0
ceph_client_io_ops{cluster="ceph"} 0
ceph_client_io_read_bytes{cluster="ceph"} 0
ceph_client_io_read_ops{cluster="ceph"} 0
ceph_client_io_write_bytes{cluster="ceph"} 0
ceph_client_io_write_ops{cluster="ceph"} 0
ceph_recovery_io_bytes{cluster="ceph"} 0
ceph_recovery_io_keys{cluster="ceph"} 0
ceph_recovery_io_objects{cluster="ceph"} 0

ceph11 ~ # ceph --version
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)


ceph11 ~ # ceph -s
  cluster:
    id:     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph11,ceph12,ceph13
    mgr: ceph12(active), standbys: ceph13, ceph11
    osd: 28 osds: 28 up, 28 in
    rgw: 3 daemons active
 
  data:
    pools:   6 pools, 1064 pgs
    objects: 60452k objects, 6684 GB
    usage:   23480 GB used, 16565 GB / 40045 GB avail
    pgs:     1062 active+clean
             2    active+clean+scrubbing+deep
 
  io:
    client:   231 kB/s rd, 4014 kB/s wr, 243 op/s rd, 393 op/s wr
 
ceph11 ~ #

Feature Request: Scrub/Deepscru/slow request counts

Would it be possible to export counts for scrub, deepscrub, blocked requests, osds with blocked requests? These metrics are very useful to graph against latency and i/o bandwidth for evaluating cluster performance.

Failing to collect IO and health metrics

When connected to a Firefly cluster the IO and health collectors fail to parse. Here is an extract of the exporter output.

2016/08/31 11:32:11 Starting ceph exporter on ":9128"
2016/08/31 11:32:17 [ERROR] cannot extract total bytes: strconv.ParseFloat: parsing "": invalid syntax
2016/08/31 11:32:17 [ERROR] cannot extract used bytes: strconv.ParseFloat: parsing "": invalid syntax
2016/08/31 11:32:17 [ERROR] cannot extract available bytes: strconv.ParseFloat: parsing "": invalid syntax
2016/08/31 11:32:17 failed collecting cluster health metrics: strconv.ParseFloat: parsing "": invalid syntax
2016/08/31 11:32:17 [ERROR] Unable to collect data from ceph osd df rados: Invalid argument
2016/08/31 11:32:17 failed collecting osd metrics: rados: Invalid argument

What is the version of CEPH this collector has been developed/tested on ?