Git Product home page Git Product logo

gluster-prometheus's Introduction

Prometheus exporter for Gluster Metrics

Build Status

These exporters will be run on all Gluster peers, So it makes sense to collect only local metrics and aggregate in Prometheus server when required.

Install

mkdir -p $GOPATH/src/github.com/gluster
cd $GOPATH/src/github.com/gluster
git clone https://github.com/gluster/gluster-prometheus.git
cd gluster-prometheus

# Install the required dependancies.
# Hint: assumes that GOPATH and PATH are already configured.
./scripts/install-reqs.sh

PREFIX=/usr make
PREFIX=/usr make install

Usage

Run gluster-exporter with default settings, glusterd is consumable at http://localhost:9713/metrics

systemctl enable gluster-exporter
systemctl start gluster-exporter

Systemd service uses following configuration file for global and collectors related configurations.

/etc/gluster-exporter/gluster-exporter.toml
[globals]
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
# If you want to connect to a remote gd1 host, set the variable gd1-remote-host
# However, using a remote host restrict the gluster cli to read-only commands
# The following collectors won't work in remote mode : gluster_volume_counts, gluster_volume_profile
#gd1-remote-host = "localhost"
gd2-rest-endpoint = "http://127.0.0.1:24007"
port = 9713
metrics-path = "/metrics"
log-dir = "/var/log"
log-file = "gluster-exporter.log"
log-level = "info"

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false

[collectors.gluster_brick]
name = "gluster_brick"
sync-interval = 5
disabled = false

To use gluster-exporter without systemd,

gluster-exporter --config=/etc/gluster-exporter/gluster-exporter.toml

Metrics

List of supported metrics are documented here.

Adding New metrics

glusterCPUPercentage = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Namespace: "gluster",
        Name:      "cpu_percentage",
        Help:      "CPU Percentage used by Gluster processes",
    },
    []string{"volume", "peerid", "brick_path"},
)
  • Implement the function to gather data, and register to gather data in required interval

prometheus.MustRegister(glusterCPUPercentage)

registerMetric("gluster_brick", brickUtilization)
  • Add an entry in /etc/gluster-exporter/gluster-exporter.toml

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false
  • Thats it! Exporter will run these registered metrics.

gluster-prometheus's People

Contributors

april4 avatar aravindavk avatar aruniiird avatar drevilish avatar foygl avatar keis avatar madhu-1 avatar martialblog avatar rafis avatar sidharthanup avatar simu avatar smuth4 avatar vlzzz avatar vredara avatar yjbdsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gluster-prometheus's Issues

Brick Metrics not Loading

When I access the /metrics endpoint on my gluster peers.

I currently only get the endpoints for gluster_* are

gluster_cpu_percentage
gluster_elapsed_time_seconds
gluster_memory_percentage
gluster_resident_memory
gluster_virtual_memory

Linux g01 4.14.5-92 #1 SMP PREEMPT Mon Dec 11 15:48:15 UTC 2017 armv7l armv7l armv7l GNU/Linux

glusterfs 4.1.4

Include gluster-prometheus in the Fedora package collection

In order to include the gluster_exporter in the official Fedora containers (and later CentOS), gluster-prometheus needs to be included in the Fedora repository. Please see if one of the main contributors can add and maintain the Fedora package. The steps in the Package Review Process explain what is needed to get the package included.

Feel free to contact me if any assistance is needed.

glusterd1 and glusterd2 compatibility

Gluster Prometheus exporter should be compatible with glusterd1 and
glusterd2 whenever it is possible.

Below flags to exporter will help to identify the management daemon
used.

--gluster-mgmt=<name>  (Choices: glusterd1, glusterd2. Default is "glusterd1")
--glusterd-dir=<path> (Default is `/var/lib/glusterd2` in case of
    glusterd2 else `/var/lib/glusterd`)

Plan

Metrics collectors should not run gluster cli or REST api calls. They
all consume input json files and export the metrics or collects the
additional details and exports it.

Example - Number of Volumes: A separate thread which runs gluster volume info --xml if configured to use glusterd1 else calls REST API
GET http://localhost:24007/v1/volumes. Both will generate the output
json in the same format which will be understood by metrics collector
thread. Metrics collector thread need not check the data coming from
glusterd1 or glusterd2.

Example - Brick utilization: Volume info json collected as above,
metrics collector will look into the volinfo.json and collects the
utilization details and exports the details.

Example - Volume Profile: Compatibility thread will run CLI and
consume xml in case of glusterd1, runs REST API in case of
glusterd2.

Support of enable/disable of Metrics

All exported metrics should expose configuration via CLI or config to
enable/disable. For example, --utilization-metrics=enabled. This
will give more control for choosing the metrics which are compatible
for the target deployment.

Auto disable Metrics collection if not compatible

If a metric is not supported with glusterd1, that should be
automatically disabled based on the flag --gluster-mgmt=glusterd2

Roadmap

Hi, please provide a roadmap for this gluster prometheus exporter

Use toml format for Configuration

Currently JSON format is used for configuration files(global and collectors). We can merge both the configs into one and use toml format for better usability.

Example config file

[globals]
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
gd2-rest-endpoint = "http://192.168.10.11:24007"
port = 8080
metrics-path = "/metrics"

[collectors.gluster_ps]
name = "gluster_ps"
sync-interval = 5
disabled = false

[collectors.gluster_brick]
name = "gluster_brick"
sync-interval = 5
disabled = false

deleted volume metrics are still present

Volumes that are deleted from the cluster still reporting metrics in all the counters. To reproduce just create and delete a volume. restarting the gluster-exporter.service does not clean and lsof does not list a file that should be cleaned to reset the metrics. Do i miss something on how to operate the exporter?

Expose Geo replication related metrics

Inside Geo Replication, we will need:

  1. gluster_geo_rep_session_total
  2. gluster_geo_rep_session_up
  3. gluster_geo_rep_session_stopped
  4. gluster_geo_rep_session_created
  5. gluster_geo_rep_session_partial
  6. gluster_geo_rep_session_paused
  7. gluster_geo_rep_session_down

Document strategy for gluster-prometheus containers in kube/openshift

Problem:

There are a number of isolated discussions regarding how this code will be used in kube:

  • #3 - general how to deploy this repo
  • #17 - how to deploy
  • #22 - What to do about cluster-wide metrics vs node-level metrics

Aside from a request for general how-to documentation, there is a need to figure out how to provide both node-level and cluster-level metrics. Providing cluster-level metrics from all nodes (pods) will lead to duplication.

Discussion

The configuration seems to be going the way of being able to enable/disable various metrics per instance (#24). This seems like a reasonable way to fix the above issue. It would entail running a per-node collector for node-level metrics and a single Deployment per cluster for cluster-level metrics.

An alternative would be to have the node-level collectors participate in leader-election, with the result being a single instance that exports cluster-level metrics in addition to its node-level metrics. This strikes me as considerably more complicated than the static approach, above.

Request

Choose one of the above (or another) approaches and document it as the plan for eventual deployment. It need not be implemented immediately, but the choice here affects other projects, namely https://github.com/gluster/anthill.

Get Local PeerID

Can pull peerID automatically either by

$ gluster pool list | awk '/localhost/ {print $1}'
or
$ cat /var/lib/glusterd/glusterd.info which stores the Local Peers UUID

The second seems a better option because a go script can read the file.

parts := strings.Split(gdInfo, "=")
parts[1]

Should return the UUID

Dependancy isn't cross arch compatibility

While trying to perform an install of the latest software, I get an error:

Odroid HC2 - Armbian - Ubuntu 18.04.1 LTS - 4.14.69-odroidxu4
go version go1.10.1 linux/arm

Go1.9 or later is available on the system.

dep package is missing on the system
gometalinter package is missing on the system
Makefile:61: recipe for target 'check-reqs' failed
make: *** [check-reqs] Error 1

gometalinter isn't released for arm32 or arm64 so has to be installed from source at build time.

Add snapshot brick count metrics for gluster volumes

This would be useful metrics figuring out if the maximum allowed no of bricks on a gluster node is reached.
We can add a metrics namely gluster_volume_snapshot_brick_count by looking at no of snapshots for a volume and deciding the no of bricks for snapshots.

Build issues with Makefile

we are dependent on git for generating versions. if we have src tarball copied to some other location, say its not git-repo, make gluster-exporter fails, in Centos I get
/usr/lib/golang/pkg/tool/linux_amd64/link: -B argument must have even number of digits: 0xundefined

Expose Rebalance counters

  1. gluster_rebalance_files
  2. gluster_rebalance_bytes
  3. gluster_rebalance_skipped
  4. gluster_rebalance_failures

Clients connected to a volume or brick

Some volumes may be used by many clients. Performance may be impacted if many clients use the same volume. A count of the clients connected to a volume (or even per brick?) will be useful.

Integrate with GCS

With the v0.1 of GCS, we now have working gd2 + CSI driver combo, which can be deployed on k8s cluster, and one can start using PVC to get PV for their pods...

It would be great to get this project integrated with the project to show the real value to our users!

Change the signature of internal registering gluster-exporter function

Current registering function has a signature as follows,
func () error
During the implementation side, we tend to use global variables (which are initialized in the main function).
This issue is created to standardize the registering function in such a way that we avoid any use of global variables and instead we pass the needed variables as argument into the function.

For example, it can be changed to accept a GInterface object (as most of the time we use a GInterface object to get LocalPeerID etc)
func (glusterutils.GInterface) error

or if you require / need config details also in the registering function, the signature can be,
func (glusterutils.GInterface, *conf.Config) error

We may have to asses the existing code and see what all global variables are being used and generalize the function signature.

grafana dashboard?

Can you provide an templated dashboard at grafana.net for the exporter? It's better to have them there, rather than having to hand-roll yourself each time.

Docker?

I've created a Dockerfile, the project builds and runs, but it doesn't serve any Gluster info (no crash though).

I didn't give it access to any resources (like Gluster directories, files, devices and sockets), because I don't know which ones are needed.

Could someone explain (or document) these requirements?

Number of volumes

It would be useful to have the number of volumes provided as well. With this users can see if there are any kubernetes/openshift jobs requesting many volumes at a certain time.

glusterfs inside openshift monitoring

Sorry for creating an issue but I was not sure where ti place a question.

I am using glusterfs storage inside opensshift, it runs as pods and provides storage to to other pods/apps in the openshift cluster.

I assume that gluster-prometheus is designed to be running on the same host where gluster is running. But how can I use it to scrape data from the pods in the openshift, if that is at all possible?

Are there any metrics available to monitor gluster heal status and peer status?

Thanks!

Export LVM usage metrics

Based on brick path export the metrics about the underlying LVM. Below metrics would be required

  • LVM data size
  • LVM used percentage
  • LVM metadata size
  • LVM metadata usage percentage

@aravindavk @JohnStrunk @atinmu @amarts comments?
Or is there something else which we need to capture as well as part of LVM usage metrics?

Gluster Prometheus should also provide brick level IOPS on from Interval Stats

Currently GP provides gluster_volume_profile_total_writes, gluster_volume_profile_total_reads, gluster_volume_profile_duration_secs which are cumulative stats. These cannot be used to calculate the IOPS for a brick or a volume. GP should provide calculated brick level iops based on the refresh interval of GP.

Expose disk stats

Underlying disks stats for bricks would be meaningful to figure out how the actual disks are performing.

Gluster Install Version

Add a metric for currently install glusterfs and glusterd versions.
Easily see which nodes in the cluster aren't running at the latest/current stable version.

Brick metrics

Provide brick metrics like inode utilization, thin pool utilization, free/total space in brick level.
provide fops count by enabling diagnostics of fops.

Define the hierarchy of metrics to be mainatained in prometheus

We need to define the hierarchical structure of metrics which would be maintained with prometheus. A sample structure could be something like

clusters/{cluster-id}/{cluster-specific-metrics like cluster health, no of connections etc}
clusters/{cluster-id}/volumes/{volume-id}/{volume-specific-metrics like no of bricks, status etc}
clusters/{cluster-id}/volumes/{volume-id}/bricks/{brick-id/brick-name}/{brick-specific-metrics like utilization etc}

cpu_percentage metrics don't decrease

Hi,

i am not sure how to provide debug data but i while running some tests with smallfile utility i noticed that the metrics exposed by gluster-prometheus-exporter gets invalidated at some point and they take really long time to recover or decrease. Example:

  • version tag 0.3
  • cpu metric hits > 100%
  • cpu metrics remains high
  • after aprox 12 hrs cpu metric slowly goes down but does not reflect reality
  • i do not run any aggregation on the metric
    see graphs for example.
    image
    As you can see the cPU hit the 200% then even if i stopped the volume the metric remains high. overnight with the volume stopped the metric decreases slowly but without reaching the actual value.
    Not sure if this could affect other counters but does not look like this. Also if this % metric is going above 100 should it have a ceil() prometheus function?

volume/cluster level metrics

provide metrics on volume level and cluster level metrics.
Number of volumes, state of volumes, number of nodes with state, volume utilization

List of metrics from WA perspective

Created this list as one place where all the required metrics from WA perspective are captured. These might be overlapped with other issues already listed

  • Cluster Level
    -- Overall Status (healthy/unhealthy)
    -- No of snapshots
    -- No if peers
    -- No of volumes
    -- No of bricks
    -- geo-rep session (state wise breakup)
    -- no of client connections
    -- Overall IOPS
    -- Overall capacity utilization
    -- Capacity Available
    -- Weekly growth rate
    -- weeks pending
    -- Throughput
    -- Volume status wise breakup
    -- Host (peer) status wise breakup
    -- Brick status wise breakup

  • Volume Level
    -- State
    -- Bricks (status wise breakup)
    -- No of bricks
    -- geo-rep sessions (state wise breakup)
    -- rebalance counters
    -- rebalance status
    -- no of snapshots
    -- capacity utilization
    -- capacity available
    -- weekly growth rate
    -- weeks pending
    -- IOPS
    -- LVM utilization (metadata and data)
    -- profiling info (e.g. fop for locks, read-write, inode operations, entry operations)

  • Brick Level
    -- status
    -- capacity utilization
    -- capacity available
    -- weekly growth rate
    -- weeks pending
    -- heal information
    -- IOPS
    -- LVM usage (metadata and data)
    -- throughput
    -- latency
    -- mount point utilization

  • Host/Peer Level
    -- Status (connected/disconnected)
    -- no of bricks
    -- bricks status wise breakup
    -- memory available
    -- memory utilization
    -- swap free
    -- swap utilization
    -- CPU utilization
    -- network throughput
    -- dropped packets per sec
    -- errors per sec

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.