gluster / gluster-prometheus Goto Github PK

Gluster monitoring using Prometheus

License: GNU Lesser General Public License v2.1

Go 87.34% Makefile 2.07% Shell 9.56% Dockerfile 1.03%

gluster-prometheus's Introduction

Prometheus exporter for Gluster Metrics

These exporters will be run on all Gluster peers, So it makes sense to collect only local metrics and aggregate in Prometheus server when required.

Install

mkdir -p $GOPATH/src/github.com/gluster
cd $GOPATH/src/github.com/gluster
git clone https://github.com/gluster/gluster-prometheus.git
cd gluster-prometheus

# Install the required dependancies.
# Hint: assumes that GOPATH and PATH are already configured.
./scripts/install-reqs.sh

PREFIX=/usr make
PREFIX=/usr make install

Usage

Run gluster-exporter with default settings, glusterd is consumable at http://localhost:9713/metrics

systemctl enable gluster-exporter
systemctl start gluster-exporter

Systemd service uses following configuration file for global and collectors related configurations.

/etc/gluster-exporter/gluster-exporter.toml

[globals]
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
# If you want to connect to a remote gd1 host, set the variable gd1-remote-host
# However, using a remote host restrict the gluster cli to read-only commands
# The following collectors won't work in remote mode : gluster_volume_counts, gluster_volume_profile
#gd1-remote-host = "localhost"
gd2-rest-endpoint = "http://127.0.0.1:24007"
port = 9713
metrics-path = "/metrics"
log-dir = "/var/log"
log-file = "gluster-exporter.log"
log-level = "info"

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false

[collectors.gluster_brick]
name = "gluster_brick"
sync-interval = 5
disabled = false

To use gluster-exporter without systemd,

gluster-exporter --config=/etc/gluster-exporter/gluster-exporter.toml

Metrics

List of supported metrics are documented here.

Adding New metrics

Add new file under gluster-exporter directory.
Define Metrics depending on the type of Metric(https://prometheus.io/docs/concepts/metric_types/) For example, "Gauge" Metrics type

glusterCPUPercentage = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Namespace: "gluster",
        Name:      "cpu_percentage",
        Help:      "CPU Percentage used by Gluster processes",
    },
    []string{"volume", "peerid", "brick_path"},
)

Implement the function to gather data, and register to gather data in required interval

prometheus.MustRegister(glusterCPUPercentage)

registerMetric("gluster_brick", brickUtilization)

Add an entry in /etc/gluster-exporter/gluster-exporter.toml

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false

Thats it! Exporter will run these registered metrics.

gluster-prometheus's People

Contributors

Stargazers

Watchers

gluster-prometheus's Issues

Brick Metrics not Loading

When I access the /metrics endpoint on my gluster peers.

I currently only get the endpoints for gluster_* are

gluster_cpu_percentage
gluster_elapsed_time_seconds
gluster_memory_percentage
gluster_resident_memory
gluster_virtual_memory

Linux g01 4.14.5-92 #1 SMP PREEMPT Mon Dec 11 15:48:15 UTC 2017 armv7l armv7l armv7l GNU/Linux

glusterfs 4.1.4

Include gluster-prometheus in the Fedora package collection

In order to include the gluster_exporter in the official Fedora containers (and later CentOS), gluster-prometheus needs to be included in the Fedora repository. Please see if one of the main contributors can add and maintain the Fedora package. The steps in the Package Review Process explain what is needed to get the package included.

Feel free to contact me if any assistance is needed.

glusterd1 and glusterd2 compatibility

Gluster Prometheus exporter should be compatible with glusterd1 and
glusterd2 whenever it is possible.

Below flags to exporter will help to identify the management daemon
used.

--gluster-mgmt=<name>  (Choices: glusterd1, glusterd2. Default is "glusterd1")
--glusterd-dir=<path> (Default is `/var/lib/glusterd2` in case of
    glusterd2 else `/var/lib/glusterd`)

Plan

Metrics collectors should not run gluster cli or REST api calls. They
all consume input json files and export the metrics or collects the
additional details and exports it.

Example - Number of Volumes: A separate thread which runs gluster volume info --xml if configured to use glusterd1 else calls REST API
GET http://localhost:24007/v1/volumes. Both will generate the output
json in the same format which will be understood by metrics collector
thread. Metrics collector thread need not check the data coming from
glusterd1 or glusterd2.

Example - Brick utilization: Volume info json collected as above,
metrics collector will look into the volinfo.json and collects the
utilization details and exports the details.

Example - Volume Profile: Compatibility thread will run CLI and
consume xml in case of glusterd1, runs REST API in case of
glusterd2.

Support of enable/disable of Metrics

All exported metrics should expose configuration via CLI or config to
enable/disable. For example, --utilization-metrics=enabled. This
will give more control for choosing the metrics which are compatible
for the target deployment.

Auto disable Metrics collection if not compatible

If a metric is not supported with glusterd1, that should be
automatically disabled based on the flag --gluster-mgmt=glusterd2

[question|feauture] metrics from /var/run/gluster/metrics

Glusterfs 4.0 brings tons of useful metrics, which can be read from /var/run/gluster/metrics after sending USR2 to gluster processes (https://docs.gluster.org/en/latest/release-notes/4.1.0/#monitoring).
Is it planning to add those metrics into exporter?
Would be very helpful for us.

Add metrics for counts of PVs, LVs, VGs and thin pools

This metric is useful for having alerts if the allowed limits for a peer for these counts exceed the limits.

Rename binary and service name as `gluster-prometheus-exporter`

RPM package name and archive tar name is gluster-prometheus-exporter but the binary name is gluster-exporter and service file name is gluster-exporter.service. Change this name to gluster-prometheus-exporter and gluster-prometheus-exporter.service

Enable rpm build for gluster-prometheus

We should write the rpm spec so that rpm can be built out of this project

Roadmap

Hi, please provide a roadmap for this gluster prometheus exporter

Use toml format for Configuration

Currently JSON format is used for configuration files(global and collectors). We can merge both the configs into one and use toml format for better usability.

Example config file

[globals]
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
gd2-rest-endpoint = "http://192.168.10.11:24007"
port = 8080
metrics-path = "/metrics"

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false

[collectors.gluster_brick]
name = "gluster_brick"
sync-interval = 5
disabled = false

deleted volume metrics are still present

Volumes that are deleted from the cluster still reporting metrics in all the counters. To reproduce just create and delete a volume. restarting the gluster-exporter.service does not clean and lsof does not list a file that should be cleaned to reset the metrics. Do i miss something on how to operate the exporter?

Add mixins for Grafana and Alerting rules

Monitoring Mixins are the way to add grafana dashboard templates and alerting rules to the K8s/OpenShift.

https://github.com/kubernetes-monitoring/kubernetes-mixin

gluster-prometheus should run as a systemd service

Expose Geo replication related metrics

Inside Geo Replication, we will need:

gluster_geo_rep_session_total
gluster_geo_rep_session_up
gluster_geo_rep_session_stopped
gluster_geo_rep_session_created
gluster_geo_rep_session_partial
gluster_geo_rep_session_paused
gluster_geo_rep_session_down

gluster_exporter or gluster-prometheus?

is this project replacing https://github.com/ofesseler/gluster_exporter or any relationship?
This looks pretty much not maintained however in Prometheus third party integration is the one mentioned:
https://prometheus.io/docs/instrumenting/exporters/

I have tested that exporter it seems collecting some metrics that here are not available?

thanks

Document strategy for gluster-prometheus containers in kube/openshift

Problem:

There are a number of isolated discussions regarding how this code will be used in kube:

#3 - general how to deploy this repo
#17 - how to deploy
#22 - What to do about cluster-wide metrics vs node-level metrics

Aside from a request for general how-to documentation, there is a need to figure out how to provide both node-level and cluster-level metrics. Providing cluster-level metrics from all nodes (pods) will lead to duplication.

Discussion

The configuration seems to be going the way of being able to enable/disable various metrics per instance (#24). This seems like a reasonable way to fix the above issue. It would entail running a per-node collector for node-level metrics and a single Deployment per cluster for cluster-level metrics.

An alternative would be to have the node-level collectors participate in leader-election, with the result being a single instance that exports cluster-level metrics in addition to its node-level metrics. This strikes me as considerably more complicated than the static approach, above.

Request

Choose one of the above (or another) approaches and document it as the plan for eventual deployment. It need not be implemented immediately, but the choice here affects other projects, namely https://github.com/gluster/anthill.

Get Local PeerID

Can pull peerID automatically either by

$ gluster pool list | awk '/localhost/ {print $1}'
or
$ cat /var/lib/glusterd/glusterd.info which stores the Local Peers UUID

The second seems a better option because a go script can read the file.

parts := strings.Split(gdInfo, "=")
parts[1]

Should return the UUID

Export Volume Profile Metrics

Add support for collecting and exporting Volume profile metrics compatible to Glusterd1 and Glusterd2

Dependancy isn't cross arch compatibility

While trying to perform an install of the latest software, I get an error:

Odroid HC2 - Armbian - Ubuntu 18.04.1 LTS - 4.14.69-odroidxu4
go version go1.10.1 linux/arm

Go1.9 or later is available on the system.

dep package is missing on the system
gometalinter package is missing on the system
Makefile:61: recipe for target 'check-reqs' failed
make: *** [check-reqs] Error 1

gometalinter isn't released for arm32 or arm64 so has to be installed from source at build time.

Enable the Volume profile if profile collector is enabled

This is required to make sure if profiling related metrics needs to be collected, profiling should be enabled for the volumes.

Make sync interval for collectors configurable at collectors level

Add snapshot brick count metrics for gluster volumes

This would be useful metrics figuring out if the maximum allowed no of bricks on a gluster node is reached.
We can add a metrics namely gluster_volume_snapshot_brick_count by looking at no of snapshots for a volume and deciding the no of bricks for snapshots.

Build issues with Makefile

we are dependent on git for generating versions. if we have src tarball copied to some other location, say its not git-repo, make gluster-exporter fails, in Centos I get
/usr/lib/golang/pkg/tool/linux_amd64/link: -B argument must have even number of digits: 0xundefined

Expose Rebalance counters

gluster_rebalance_files
gluster_rebalance_bytes
gluster_rebalance_skipped
gluster_rebalance_failures

Clients connected to a volume or brick

Some volumes may be used by many clients. Performance may be impacted if many clients use the same volume. A count of the clients connected to a volume (or even per brick?) will be useful.

Integrate with GCS

With the v0.1 of GCS, we now have working gd2 + CSI driver combo, which can be deployed on k8s cluster, and one can start using PVC to get PV for their pods...

It would be great to get this project integrated with the project to show the real value to our users!

Explain the deployment architecture for gluster prometheus

We need to document how the gluster-prometheus works and how it fits and gets consumed in overall gluster eco-system or GCS (Gluster Container Storage) eco system.

add device mapper name label to brick stats

Not sure if it relates to the expose disk stats issue but would be useful to have the device mapper like (dm-X) name as label for any brick statistic since it would make it easier to identify the brick in telegraf diskio.

Change the signature of internal registering gluster-exporter function

Current registering function has a signature as follows,
func () error
During the implementation side, we tend to use global variables (which are initialized in the main function).
This issue is created to standardize the registering function in such a way that we avoid any use of global variables and instead we pass the needed variables as argument into the function.

For example, it can be changed to accept a GInterface object (as most of the time we use a GInterface object to get LocalPeerID etc)
func (glusterutils.GInterface) error

or if you require / need config details also in the registering function, the signature can be,
func (glusterutils.GInterface, *conf.Config) error

We may have to asses the existing code and see what all global variables are being used and generalize the function signature.

Brick and Volume status metrics

gluster_volume_status
gluster_brick_status

grafana dashboard?

Can you provide an templated dashboard at grafana.net for the exporter? It's better to have them there, rather than having to hand-roll yourself each time.

Docker?

I've created a Dockerfile, the project builds and runs, but it doesn't serve any Gluster info (no crash though).

I didn't give it access to any resources (like Gluster directories, files, devices and sockets), because I don't know which ones are needed.

Could someone explain (or document) these requirements?

Prometheus rules for Volume utilization

Sub volume utilization is exported from each nodes, Rules needs to be added to get Volume utilization.

Number of volumes

It would be useful to have the number of volumes provided as well. With this users can see if there are any kubernetes/openshift jobs requesting many volumes at a certain time.

glusterfs inside openshift monitoring

Sorry for creating an issue but I was not sure where ti place a question.

I am using glusterfs storage inside opensshift, it runs as pods and provides storage to to other pods/apps in the openshift cluster.

I assume that gluster-prometheus is designed to be running on the same host where gluster is running. But how can I use it to scrape data from the pods in the openshift, if that is at all possible?

Are there any metrics available to monitor gluster heal status and peer status?

Thanks!

Export LVM usage metrics

Based on brick path export the metrics about the underlying LVM. Below metrics would be required

LVM data size
LVM used percentage
LVM metadata size
LVM metadata usage percentage

@aravindavk @JohnStrunk @atinmu @amarts comments?
Or is there something else which we need to capture as well as part of LVM usage metrics?

Gluster Prometheus should also provide brick level IOPS on from Interval Stats

Currently GP provides gluster_volume_profile_total_writes, gluster_volume_profile_total_reads, gluster_volume_profile_duration_secs which are cumulative stats. These cannot be used to calculate the IOPS for a brick or a volume. GP should provide calculated brick level iops based on the refresh interval of GP.

Add a metrics for subvolume total capactiy

This is required to figure out total capacity of volume to calculate used percentage for alerting purpose

All the metrics in gluster prometheus should have a Cluster Id (any unique identifier)

RHGSWA - supports multi-cluster scenario so this will need the unique identifier.

Expose disk stats

Underlying disks stats for bricks would be meaningful to figure out how the actual disks are performing.

Status metrics for gluster daemon

gluster_prometheus_up
gluster_server_up
gluster_gd2_up
gluster_csi_up
gluster_block_csi_up
gluster_block_up
gluster_operator_up

Gluster Install Version

Add a metric for currently install glusterfs and glusterd versions.
Easily see which nodes in the cluster aren't running at the latest/current stable version.

Brick metrics

Provide brick metrics like inode utilization, thin pool utilization, free/total space in brick level.
provide fops count by enabling diagnostics of fops.

Define the hierarchy of metrics to be mainatained in prometheus

We need to define the hierarchical structure of metrics which would be maintained with prometheus. A sample structure could be something like

clusters/{cluster-id}/{cluster-specific-metrics like cluster health, no of connections etc}
clusters/{cluster-id}/volumes/{volume-id}/{volume-specific-metrics like no of bricks, status etc}
clusters/{cluster-id}/volumes/{volume-id}/bricks/{brick-id/brick-name}/{brick-specific-metrics like utilization etc}

cpu_percentage metrics don't decrease

Hi,

i am not sure how to provide debug data but i while running some tests with smallfile utility i noticed that the metrics exposed by gluster-prometheus-exporter gets invalidated at some point and they take really long time to recover or decrease. Example:

version tag 0.3
cpu metric hits > 100%
cpu metrics remains high
after aprox 12 hrs cpu metric slowly goes down but does not reflect reality
i do not run any aggregation on the metric
see graphs for example.

As you can see the cPU hit the 200% then even if i stopped the volume the metric remains high. overnight with the volume stopped the metric decreases slowly but without reaching the actual value.
Not sure if this could affect other counters but does not look like this. Also if this % metric is going above 100 should it have a ceil() prometheus function?

Cluster Level
-- Overall Status (healthy/unhealthy)
-- No of snapshots
-- No if peers
-- No of volumes
-- No of bricks
-- geo-rep session (state wise breakup)
-- no of client connections
-- Overall IOPS
-- Overall capacity utilization
-- Capacity Available
-- Weekly growth rate
-- weeks pending
-- Throughput
-- Volume status wise breakup
-- Host (peer) status wise breakup
-- Brick status wise breakup
Volume Level
-- State
-- Bricks (status wise breakup)
-- No of bricks
-- geo-rep sessions (state wise breakup)
-- rebalance counters
-- rebalance status
-- no of snapshots
-- capacity utilization
-- capacity available
-- weekly growth rate
-- weeks pending
-- IOPS
-- LVM utilization (metadata and data)
-- profiling info (e.g. fop for locks, read-write, inode operations, entry operations)
Brick Level
-- status
-- capacity utilization
-- capacity available
-- weekly growth rate
-- weeks pending
-- heal information
-- IOPS
-- LVM usage (metadata and data)
-- throughput
-- latency
-- mount point utilization
Host/Peer Level
-- Status (connected/disconnected)
-- no of bricks
-- bricks status wise breakup
-- memory available
-- memory utilization
-- swap free
-- swap utilization
-- CPU utilization
-- network throughput
-- dropped packets per sec
-- errors per sec