Git Product home page Git Product logo

gardener-metrics-exporter's Introduction

Gardener Metrics Exporter

REUSE status

The gardener-metrics-exporter is a Prometheus metrics exporter for Gardener service-level related metrics.

This application requires Go 1.9 or later.

Metrics

Metric Description Scope Type
garden_shoot_operation_states Operation state of a Shoot Shoot Gauge
garden_shoot_info Information to a Shoot Shoot Gauge
garden_shoot_condition Condition state of a Shoot Shoot Gauge
garden_shoot_node_min_total Min node count of a Shoot Shoot Gauge
garden_shoot_node_max_total Max node count of a Shoot Shoot Gauge
garden_shoot_worker_node_min_total Min node count of a Shoot worker group Shoot Gauge
garden_shoot_worker_node_max_total Max node count of a Shoot worker group Shoot Gauge
garden_shoot_operations_total Count of ongoing operations Shoot Gauge
garden_shoot_operation_states Operation State of a Shoot Shoot Gauge
garden_shoot_operation_progress_percent Operation Percentage of a Shoot Shoot Gauge
garden_seed_info Information to a Seed Seed Gauge
garden_seed_capacity Information regarding a seed's capacity with respect to certain resources Seed Gauge
garden_seed_condition Condition State of a Seed Seed Gauge
garden_seed_usage Actual usage of seed by resources Seed Gauge
garden_projects_status Status of Garden Projects Projects Gauge
garden_users_total Count of users Users Gauge
garden_scrape_failure_total Total count of scraping failures, grouped by kind/group of metric(s) App Counter

Grafana Dashboards

Some Grafana dashboards are included in the dashboards folder. Simply import them and make sure you have your Prometheus data source named to cluster-prometheus.

Usage

First, clone the repo into your $GOPATH.

mkdir -p "$GOPATH/src/github.com/gardener"
git clone https://github.com/gardener/gardener-metrics-exporter.git \
          "$GOPATH/src/github.com/gardener/gardener-metrics-exporter"

cd "$GOPATH/src/github.com/gardener/gardener-metrics-exporter"

Local

The metrics exporter needs to run against a Gardener environment (Kubernetes cluster extendend with core.gardener.cloud/v1alpha1 api group). Such an environment can be created by following the instructions the gardener local setup.

If the current-context of your $HOME/.kube/config point to a Gardener environment then you can simply run:

make start

If you plan to pass a specific kubeconfig then you need to build the app locally and pass a kubeconfig to the binary. Let's build the app for your environment locally and run it. The binary will be located in the ./bin directory of the repository.

# Build
make build-local

# Run
./bin/gardener-metrics-exporter --kubeconfig=<path-to-kubeconfig-file>

Be aware: The user in the kubeconfig needs permissions to GET, LIST, *WATCH the resources Shoot, Seed, Project *(core.gardener.cloud/v1alpha1) in all namespaces of the cluster.

Verify that everything works by calling the /metrics endpoint of the app.

curl http://localhost:2718/metrics

Run a local Prometheus instance and add the following scrape config snippet to your config.

scrape_configs:
- job_name: 'gardener-metrics-exporter'
  static_configs:
    - targets:
      - localhost:2718
  metric_relabel_configs:
   - source_labels: [ __name__ ]
     regex: '^garden_.*$'
     action: keep

Now the metrics should be collected by Prometheus. Open the Prometheus console and query for garden_* metrics.

In Cluster

Deploy the metrics-exporter to a kubernetes cluster via helm.

helm upgrade gardener-metrics-exporter charts/gardener-metrics-exporter \
     --install --namespace=<your-namespace> --values=<path-to-your-values.yaml>

For example, the scrape config for your Prometheus could look like this:

scrape_configs:
- job_name: 'gardener-metrics-exporter'
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - <helm-deployment-namespace>
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_name]
    regex: gardener-metrics-exporter
    action: keep
  metric_relabel_configs:
   - source_labels: [ __name__ ]
     regex: '^garden_.*$'
     action: keep

gardener-metrics-exporter's People

Contributors

andreasburger avatar brumhard avatar ccwienk avatar dependabot[bot] avatar dergeberl avatar dimityrmirchev avatar dkistner avatar docktofuture avatar fsniper avatar gardener-robot-ci-1 avatar gardener-robot-ci-2 avatar gardener-robot-ci-3 avatar istvanballok avatar jfortin-sap avatar kumm-kai avatar msohn avatar mwennrich avatar raphaelvogel avatar rfranzke avatar rickardsjp avatar simonkienzler avatar sinscerly avatar timebertt avatar vicwicker avatar vlerenc avatar voelzmo avatar vpnachev avatar wieneo avatar wyb1 avatar zkdev avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-metrics-exporter's Issues

Chart cannot be upgraded

What happened:

When using helm upgrade or flux' helm controller to upgrade gardener-metrics-exporter, the upgrade fails with:

Helm upgrade failed: cannot patch "gardener-metrics-exporter" with kind Deployment: Deployment.apps "gardener-metrics-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"gardener", "chart":"runtime-0.28.0", "heritage":"Helm", "release":"garden-gardener-metrics-exporter-runtime", "role":"metrics-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

This is due to the version-specific selector in the Deployment but Deployment.spec.selector is immutable:

What you expected to happen:

gardener-metrics-exporter upgrade to succeed.

How to reproduce it (as minimally and precisely as possible):

From a published chart repository:

  1. Create a kind cluster:
export KUBECONFIG=/tmp/kind_kubeconfig.yaml
kind create cluster
  1. Add the gardener-charts repository:
helm repo add gardener-community https://gardener-community.github.io/gardener-charts
helm repo update
  1. Install the chart in version v0.27.0 and observe the selector:
$ helm install gardener-metrics-exporter gardener-community/gardener-metrics-exporter --set runtime.enabled=true --version 0.27.0
NAME: gardener-metrics-exporter
LAST DEPLOYED: Mon Dec 18 10:25:32 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

$ k get deploy gardener-metrics-exporter -oyaml | yq .spec.selector
matchLabels:
  app: gardener
  chart: runtime-0.27.0
  heritage: Helm
  release: gardener-metrics-exporter
  role: metrics-exporter
  1. Upgrade the chart to version v0.28.0 and observe the failure
$ helm upgrade gardener-metrics-exporter gardener-community/gardener-metrics-exporter --set runtime.enabled=true --version 0.28.0
Error: UPGRADE FAILED: cannot patch "gardener-metrics-exporter" with kind Deployment: Deployment.apps "gardener-metrics-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"gardener", "chart":"runtime-0.28.0", "heritage":"Helm", "release":"gardener-metrics-exporter", "role":"metrics-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

From the local chart:

  1. Create a kind cluster:
export KUBECONFIG=/tmp/kind_kubeconfig.yaml
kind create cluster
  1. Install the chart in version 0.2.0 and observe the selector:
$ helm install gardener-metrics-exporter ./charts/gardener-metrics-exporter
NAME: gardener-metrics-exporter
LAST DEPLOYED: Mon Dec 18 10:31:11 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

$ k get deploy gardener-metrics-exporter -oyaml | yq .spec.selector
matchLabels:
  app: gardener
  chart: runtime-0.1.0
  heritage: Helm
  release: gardener-metrics-exporter
  role: metrics-exporter
  1. Change the chart version to 0.2.0:
sed -i 's/version: 0.1.0/version: 0.2.0/g' charts/**/Chart.yaml
  1. Upgrade the chart to version 0.2.0 and observe the failure
$ helm upgrade gardener-metrics-exporter ./charts/gardener-metrics-exporter
Error: UPGRADE FAILED: cannot patch "gardener-metrics-exporter" with kind Deployment: Deployment.apps "gardener-metrics-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"gardener", "chart":"runtime-0.2.0", "heritage":"Helm", "release":"gardener-metrics-exporter", "role":"metrics-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Anything else we need to know:

Note that fixing this will require changing the immutable selector itself.
I.e., operators will need to delete the deployment once before upgrading to a newer version, where the selector will not change again.

Environment:

Static code analysis

Gardener informs its stakeholders in its CNCF CII Badge, that static code checks are applied by using Checkmarx. This repository has findings, which have to be assessed by the component owner(s). As required all prio high findings were already been immediately assessed. Please find the timeline until when to assess the remaining prio medium findings in the Wiki (restricted access). At the time being you can ignore the prio low findings. Please find background information and a link to the Checkmarx project for your repository in the Wiki (restricted access). In the Wiki (restricted access) you will as well find information how to get a Checkmarx user which is required to be able to do your assessment in the Checkmarx Web UI.

garden_shoot_node_max_total and min should contain the OS Version found in the shoot spec

What would you like to be added:
The metrics garden_shoot_node_max_total and garden_shoot_node_min_total should contain the OS version. This information is found in the shoot spec:

image:
  name: gardenlinux
  version: 27.0.0

but is not found in the monitoring.

Why is this needed:
This is needed for auditing and compliance reasons to monitor which versions of the OS exist in a gardener landscape.

Remove metric garden_shoot_response_duration_milliseconds

What would you like to be added:
garden_shoot_response_duration_milliseconds should be removed.
Why is this needed:
The metric garden_shoot_response_duration_milliseconds is no longer needed since gardener no longer tracks the response time in the shoot object.

metric decoration improvements

We are thinking about using this project to build dashboards but some missing tags are bothering us to build more effective dashboards.

Some improvement ideas:

  • garden_shoot_condition and garden_shoot_operation_states should also have seed and iaas tag so that we can filter or group by the ControlPlaneHealthy conditions and garden_shoot_operation_states based on seed or cloud. I'd vote for adding project, seed and iaas tags to all specific shoot related metrics at all.
  • garden_shoot_operation_states should have a state tag that tells about how the last operation is resulted.

For botb above garden_shoot_operations_total has the extra information is required information but doing join-like operations on prometheus is not really feasible so to create not-so-complex-dashboards the tags should be available in the immediate metrics and should not be collected with some extra queries.

Here is an example about what we have as of today:

garden_shoot_condition{condition="EveryNodeReady",mail_to="",name="axeda",operation="Reconcile",project="consus",purpose=""} 1
garden_shoot_condition{condition="SystemComponentsHealthy",mail_to="",name="axeda",operation="Reconcile",project="consus",purpose=""} 1
garden_shoot_info{iaas="azure",name="axeda",project="consus",region="eastus",seed="azure-eastus-01",version="1.11.2"} 0
garden_shoot_node_max_total{name="axeda",project="consus"} 10
garden_shoot_node_min_total{name="axeda",project="consus"} 2
garden_shoot_operation_states{mail_to="",name="axeda",operation="Create",project="consus"} 0
garden_shoot_operation_states{mail_to="",name="axeda",operation="Delete",project="consus"} 0
garden_shoot_operation_states{mail_to="",name="axeda",operation="Reconcile",project="consus"} 1

Handle duplicate shoot/seed conditions

What happened:
If a shoot or seed has duplicate conditions, the gardener-metrics-exporter will error and not expose any metrics.
For example if a shoot has the following status the gme will not be able to expose any metrics.

status:
  conditions:
  - lastTransitionTime: "2020-08-06T04:09:26Z"
    lastUpdateTime: "2020-08-06T08:56:29Z"
    message: API server /healthz endpoint responded with success status code. [response_time:2ms]
    reason: HealthzRequestSucceeded
    status: "True"
    type: APIServerAvailable
  - lastTransitionTime: "2020-08-06T04:09:26Z"
    lastUpdateTime: "2020-08-06T08:56:29Z"
    message: All control plane components are healthy.
    reason: ControlPlaneRunning
    status: "True"
    type: ControlPlaneHealthy
  - lastTransitionTime: "2020-08-06T04:09:26Z"
    lastUpdateTime: "2020-08-06T08:56:29Z"
    message: Every node registered to the cluster is ready.
    reason: EveryNodeReady
    status: "True"
    type: EveryNodeReady
  - lastTransitionTime: "2020-08-06T06:55:55Z"
    lastUpdateTime: "2020-08-06T08:56:29Z"
    message: All system components are healthy.
    reason: SystemComponentsRunning
    status: "True"
    type: SystemComponentsHealthy
  - lastTransitionTime: "2020-08-05T14:13:10Z"
    lastUpdateTime: "2020-08-05T14:13:10Z"
    message: Gardenlet is posting ready status.
    reason: GardenletReady
    status: "True"
    type: GardenletReady
  - lastTransitionTime: null
    lastUpdateTime: null
    message: ""
    reason: ""
    status: ""
    type: ""
  - lastTransitionTime: null
    lastUpdateTime: null
    message: ""
    reason: ""
    status: ""
    type: ""

In the example above there are two duplicate conditions that contain no data.
What you expected to happen:
Regardless of how duplicate conditions got there, the gme should be able to handle this without having an error. It could skip this metric and log an error instead of not exposing metrics.

How to reproduce it (as minimally and precisely as possible):
Add two duplicate conditions to a shoot/seed.

Minimum node count is Maximum node count and other way around. (wrong counting)

What happened:

  • in the code lines the following happens:
        nodeCountMax += worker.Minimum
        nodeCountMin += worker.Maximum

This is wrong as you should count max in max and min in min.

What you expected to happen:

  • I would expect to see the min in min and max in max

How to reproduce it (as minimally and precisely as possible):

  • Run gardener
  • install exporter
  • check a shoot worker minimum and maximum node count with the response of the exporter

Anything else we need to know:

Environment:

  • Gardener - OpenStack

Do not exit on errors

What would you like to be added:

If the gardener metrics exporter has an error it should not exit. Instead it should log the error and try to continue. See this:

if err := run(ctx, &options); err != nil {
log.Error(err.Error())
os.Exit(1)
}

Why is this needed:

If there is a reoccurring issue the gardener metrics exporter will crash and never properly recover. See #35

The `is_seed` label no longer works for seeds

What happened:

The is_seed label is no longer set to true for seeds. This is due to the introduction of the managed seed resource and deprecation of the shoot.gardener.cloud/use-as-seed annotation. See code.

What you expected to happen:
is_seed label should be true for seeds.

How to reproduce it (as minimally and precisely as possible):

Run gardener with managed seeds and exclude the shoot.gardener.cloud/use-as-seed annotation.

metrics decoration improvements

We are thinking about using this project to build dashboards but some missing tags are bothering us to build more effective dashboards.

Some improvement ideas:

  • garden_shoot_condition and garden_shoot_operation_states should also have seed and iaas tag so that we can filter or group by the ControlPlaneHealthy conditions and garden_shoot_operation_states based on seed or cloud. I'd vote for adding project, seed and iaas tags to all specific shoot related metrics at all.
  • garden_shoot_operation_states should have a state tag that tells about how the last operation is resulted.

For botb above garden_shoot_operations_total has the extra information is required information but doing join-like operations on prometheus is not really feasible so to create not-so-complex-dashboards the tags should be available in the immediate metrics and should not be collected with some extra queries.

Here is an example about what we have as of today:

garden_shoot_condition{condition="EveryNodeReady",mail_to="",name="axeda",operation="Reconcile",project="consus",purpose=""} 1
garden_shoot_condition{condition="SystemComponentsHealthy",mail_to="",name="axeda",operation="Reconcile",project="consus",purpose=""} 1
garden_shoot_info{iaas="azure",name="axeda",project="consus",region="eastus",seed="azure-eastus-01",version="1.11.2"} 0
garden_shoot_node_max_total{name="axeda",project="consus"} 10
garden_shoot_node_min_total{name="axeda",project="consus"} 2
garden_shoot_operation_states{mail_to="",name="axeda",operation="Create",project="consus"} 0
garden_shoot_operation_states{mail_to="",name="axeda",operation="Delete",project="consus"} 0
garden_shoot_operation_states{mail_to="",name="axeda",operation="Reconcile",project="consus"} 1

Missing metric in readme.md - `garden_shoot_node_info` / rename

What would you like to be added:

The metric garden_shoot_node_info is currently missing in the readme.md.
Also I'm proposing to change the name to garden_shoot_worker_info as it supplies information about the worker group and not directly each worker node. Although the information should be the same for each worker node.

Why is this needed:

Improving the readme with the metrics the exporter supplies.

Static code analysis

Gardener informs its stakeholders in its CNCF CII Badge, that static code checks are applied by using Checkmarx. This repository has findings, which have to be assessed by the component owner(s). As required all prio high findings were already been immediately assessed. Please find the timeline until when to assess the remaining prio medium findings in the Wiki (restricted access). At the time being you can ignore the prio low findings. Please find background information and a link to the Checkmarx project for your repository in the Wiki (restricted access). In the Wiki (restricted access) you will as well find information how to get a Checkmarx user which is required to be able to do your assessment in the Checkmarx Web UI.

AKS Cluster (garden and seed) upgrade to k8s v1.12.5 is not supported by Gardener v0.15.2

Azure AKS Cluster (garden and seed) upgrade to k8s v1.12.5 is not supported by Gardener v0.15.2, as per out chat on slack it should support.

praveend [12:58 PM]
@rfranzke which version of gardener supports AKS v1.12.5, didn't find much information in gardener release notes. Upgrading gardener is the only option as Azure doesn't support AKS k8s downgrade.

rfranzke [12:59 PM]
Any version of Gardener that supports 1.12 itself for shoot clusters, so any version 0.13.0+

Gardener Version: 0.15.2
AKS version: 1.12.5 (upgraded from 1.11.5)

Log from metrics server

$ kubectl logs gardener-metrics-exporter-5f5dbf6767-cvbvp -n garden
time="2019-02-18T11:21:36Z" level=error msg="Kubernetes cluster has version v1.12.5 which is not supported"

gardener-metrics-exporter fails with msg="open : no such file or directory"

Latest version of the exporter fails with the error:
time="2018-10-04T13:20:03Z" level=error msg="open : no such file or directory"

I'm using the included helm chart to deploy and I got success deploying with the previous version, but with the latest gardener-metrics-exporter refuses to start complaining about the aforementioned error.

I did a quick test locally and I think it expects a kubeconfig file to be specified. I think it's unnecessary to specify this for a pod which is "usually" intended to be run in the same namespace than gardener-controller-manager pod (kubeconfig option is useful for debug purposes, although), or maybe I'm getting wrong the purpose of the kubeconfig file.

How should be the deployment done in the latest version? I'm happy to help contributing on fixing the helm chart

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.