Git Product home page Git Product logo

component-openshift4-monitoring's Introduction

Commodore Component: OpenShift4 Monitoring

This is a Commodore Component for OpenShift4 Monitoring.

This repository is part of Project Syn. For documentation on Project Syn and this component, see syn.tools.

Documentation

The rendered documentation for this component is available on the Commodore Components Hub.

Documentation for this component is written using Asciidoc and Antora. It can be found in the docs folder. We use the Divio documentation structure to organize our documentation.

Run the make docs-serve command in the root of the project, and then browse to http://localhost:2020 to see a preview of the current state of the documentation.

After writing the documentation, please use the make docs-vale command and correct any warnings raised by the tool.

Contributing and license

This library is licensed under BSD-3-Clause. For information about how to contribute, see CONTRIBUTING.

component-openshift4-monitoring's People

Contributors

bastjan avatar ccremer avatar corvus-ch avatar debakelorakel avatar glrf avatar haasad avatar happytetrahedron avatar megian avatar mhutter avatar renovate-bot avatar schemen avatar simu avatar srueg avatar thebiglee avatar vshn-renovate avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

3kho

component-openshift4-monitoring's Issues

Openshift Console is unable to access prometheus

Openshift console managed by https://github.com/appuio/component-openshift4-console tries to access prometheus to display data like cluster utilization graphs. This is prohibited by the default network-policies, so the openshift-console UI takes a very long time to load and monitoring information is not displayed.

Steps to Reproduce the Problem

  1. Log in to openshift-console
  2. Go to Administator -> Overview
  3. Cluster Utilization graphs are empty and if a lot of 502 errors are returned

Actual Behavior

openshift-console is unable to access prometheus and can therefore not display monitoring information

Expected Behavior

Requests from openshift-console to openshift-monitoring should be allowed so monitoring and cluster utilization can be displayed in the openshift console.

Disabling the component removes default monitoring stack

When this component gets disabled the monitoring stack installed by default on OCP4 gets removed. The monitoring stack seems not to be installable outside a cluster installation.

Steps to Reproduce the Problem

  1. In a tenant repo disable the component for a cluster (~openshift4-monitoring)
  2. Commodore compile and push

Actual Behavior

ArgoCD removes the openshift-monitoring namespace and thus removes the whole OpenShift monitoring stack

Expected Behavior

Only additional configuration provided through this component is removed when disabling it.

Switch from hack to regular Prometheus rules

The component uses the hack Prometheus rules https://raw.githubusercontent.com/openshift/elasticsearch-operator/c89d0034c17449fc78420cd0719dde0fda795ae3/hack/prometheus-rules.yaml, which have been revised openshift/elasticsearch-operator@f238b7d.

Removed Alerts:

  • ElasticsearchNodeDiskLowForSegmentMerges

Renamed Alerts:

  • ElasticsearchBulkRequestsRejectionJumps -> ElasticsearchWriteRequestsRejectionJumps

New alerts:

  • ElasticsearchOperatorCSVNotSuccessful
  • ElasticsearchDiskSpaceRunningLow
  • ElasticsearchHighFileDescriptorUsage

The component is not minor version aware

The component currently assumes that OpenShift 4.5 is running. This causes that some rules do not match for example OpenShift 4.7. In 4.7 the SDN OVS pods disappeared. But the component still applies the rules for OpenShift 4.5.

The alert NodeWithoutOVSPod has been removed: openshift/cluster-network-operator@614e6e1

Steps to Reproduce the Problem

  1. Install OpenShift 4.7 and use the component version v1.0.0
  2. Apply the component openshift4-monitoring

Actual Behavior

The alert NodeWithoutOVSPod does pop up. But his is irrelevant, as OpenShift 4.7 does not require this pods anymore.

A simple workaround is:

parameters:
  openshift4_monitoring:
    alerts:
      ignoreNames:
        # OpenShift >=4.7 does not have OVS pods running
        # https://github.com/openshift/cluster-network-operator/commit/614e6e165a783089ecada659a3547a41b0a21b5b
        - NodeWithoutOVSPod

Expected Behavior

The rule import is based on the OpenShift version release-4.x.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

This repository currently has no open or pending branches.


  • Check this box to trigger a request for Renovate to run again on this repository

Component library has unsupported name

Officially, Commodore only supports component libraries which are named <component-name>.libj?sonnet.

This component, however provides a component library named prom.libsonnet. While this actually works with the current implementation of Commodore, the library should be called openshift4-monitoring.libsonnet to avoid conflicts if there's ever a component-prom which also provides a component library named prom.libsonnet.

Alert when cluster operator is stuck in progress and or failing

Summary

When altering the cluster version, the cluster operator and other operators have a lot of work to do. That work can get blocked for various reasons. This tends not to put the clusters availability at risk. Therefore no alerts due to service unavailabilty are triggered. However ssuch situations should be investigated and resolved.

As a cluster administrator
I want to get notified when the cluster operator remains in state "Progressing" for an extended period
So that I can ensure the cluster can finish up the version change.

Context

This situation is inidated at:

  • the output of oc adm upgrade --as cluster-admin

    info: An upgrade is in progress. Unable to apply 4.5.22: the cluster operator image-registry has not yet successfully rolled out
  • the OpenShift console at Administration > Cluster Settings > Update Status > View Details.

  • in monitoring using the Prometheus time series cluster_operator_conditions

Out of Scope

  • โ€”

Further links

Acceptance criteria

  • Alert rule(s) that cover the subject
  • Guidlines/checklist explaining how to investigate thos allerts.

Implementation Ideas

The PQL expression cluster_operator_conditions{condition="Failing"} > 0 could directly be put into an alert rule with a for of 5 minutes. Other conditions might also be usefull. For example Progressing. When progressing is going into the range of days, it would be worth a look.

Provide OpsGenie Default Configuration in Component

Context

The configuration for OpsGenie has to be copy/pasted to the tenant repository. This component should provide a way to use the default configuration as it is described in https://github.com/appuio/component-openshift4-monitoring/blob/master/docs/modules/ROOT/pages/how-tos/opsgenie.adoc#full-component-configuration without copy/pasting the whole configuration.

In such a default configuration, the support of HTTP proxy vars would be nice.

Alternatives

The current alternative is to copy/paste the configuration which is not only error-prone but will be outdated when there are improvements in the guide.

Support to add a set of labels to a list of alerts

Context

Some alerts are more important than others. The more important ones, we want to route to an on call engineer while the less important ones are left for reactive handling. In order to do so, a common label is needed to route those alerts differently. However the stock alert rules do not come with labels we can work with.

Expose a new parameter that takes a set of alert names. Those alerts get added the labels set on another parameter. Should an alert not exist, the catalog compilation fails with an error message clearly stating what the culprit is.

Alternatives

Right now, we can make use of patchRules. The way they are implemented, we do not notify should an alert no longer exist of have been renamed on a newer version.

Various SYN_PrometheusOperator* alertrules are missing

Various PrometheusOperator* alert rules are missing the SYN_* equivalent. This leads to missed alerts. eg:

Steps to Reproduce the Problem

There are
On a synfected openshift4 cluster:

  1. Generate an alertManagerConfig with errors (eg syntax error)
  2. Compile the catalog
  3. Wait for the change to be applied
  4. The prometheusOperator will not load the new config due to errors in the config
  5. The AlertRule PrometheusOperatorSyncFailed will fire, but never escalated to eg. OpsGenie.

Actual Behavior

Various alert rules regarding the PrometheusOperatorare not seen in eg. OpsGenie

Expected Behavior

All relevant alerts regarding the PRometheusOperator are forwarded to eg OpsGenie

Update component to use `patch-operator.libsonnet` directly

Context

The component currently uses resource-locker.libjsonnet in

local ns_patch =
rl.Patch(
kube.Namespace(ns),
{
metadata: {
labels: {
'network.openshift.io/policy-group': 'monitoring',
} + if std.member(inv.applications, 'networkpolicy') then {
[inv.parameters.networkpolicy.labels.noDefaults]: 'true',
[inv.parameters.networkpolicy.labels.purgeDefaults]: 'true',
} else {},
},
}
);

We should update the code to use patch-operator.libsonnet directly.

SYN_ClusterNotUpgradeable messages not rendered

The "Message" annotation for said AlertRule is not rendered and instead displayed verbatim in Prometheus, AlertManager and Opsgenie.

The "original" upstream AlertRule (ClusterNotUpgradeable) is rendered as expected in Prometheus.

Ping me in Chat for Prometheus link if required

Silence cronjob fails with ErrImagePull / ImagePullBackoff

The silence cronjob uses the Docker image quay.io/appuio/oc, and has the image pinned to a specfic sha256:

images:
oc:
image: quay.io/appuio/oc
tag: v4.6@sha256:d7a50d8812bc13e2a1cef119d268ebb051e9ffb53960942180a36982ec59f647

The Docker image is built and pushed to Quay (and DockerHub) as a GitHub action on https://github.com/appuio/container-oc.

From what I can tell, Quay will delete image layers when they are not referenced anymore by any tag. Checking the sha256 referenced in the component results in "Manifest not found". The last update for the v4.6 tag was on Wed, Jan 27, 2021 8:10 AM, cf. https://quay.io/repository/appuio/oc?tab=history.

Because there's currently no Renovate support for Commodore components, the sha256 in the component's default.yml is not updated when the Docker image is rebuilt. Since Quay appears to delete unreferenced layers, users which have the image pinned to an old version of a tag will run into image pull errors.

As a workaround, users of this component can override the image tag to not include the specific sha256:

parameters:
  openshift4_monitoring:
    images:
      oc:
        tag: 'v4.6'

Steps to Reproduce the Problem

  1. Setup OpenShift 4 cluster
  2. Enable Project Syn and deploy this component
  3. Wait for the silence cronjob to be scheduled

Actual Behavior

The silence job fails with ErrImagePull and goes into ImagePullBackoff.

Expected Behavior

Silence job completes successfully

Provide component library to allow other components to configure alert rules to match the requirements of this component

Context

Right now, if another component wants to provide alert rules that work with the framework established by this component, that component needs to ensure that the alerts have the correct labels etc. (see e.g. projectsyn/component-rook-ceph#3)

While components can reimplement the necessary alert labels and so forth, this leads to a lot of duplicated code. Instead this component should provide a component library which has helper functions that will ensure that alerts are correctly configured within the framework established by this component.

Alternatives

As mentioned above, other components can try to keep up with reimplementing the requirements of the framework established by this component.

AlertRule "ClusterNotUpgradeable" provides no value

Current situation:

ClusterNotUpgradeable is firing into Opsgenie:

Description
Alerts Firing:

- Message: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0) }}{{label "reason" $value}}.{{end}}{{end}}{{end}} {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/settings/cluster/.{{ end }}{{ end }}
Labels:
- alertname = SYN_ClusterNotUpgradeable
- cluster_id = c-xxx-ocp4
- condition = Upgradeable
- endpoint = metrics
- name = version
- prometheus = openshift-monitoring/k8s
- severity = warning
- syn = true
- tenant_id = t-purple-brook-2904
Annotations:
- message = One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0) }}{{label "reason" $value}}.{{end}}{{end}}{{end}} {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/settings/cluster/.{{ end }}{{ end }}
- syn_component = openshift4-monitoring

Upon investigating Prometheus we find that

One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason IncompatibleOperatorsInstalled. For more information refer to https://console.apps.xxx/settings/cluster/.

Upon checking the ClusterVersion object we find

Cluster operator operator-lifecycle-manager cannot be upgraded between minor versions: The following operators block OpenShift upgrades: Operator elasticsearch-operator.5.1.0-96 in namespace openshift-operators-redhat is not compatible with OpenShift versions greater than 4.8.0,Operator mtc-operator.v1.5.0 in namespace openshift-migration is not compatible with OpenShift versions greater than 4.8.0

This is great, but there is no minor version greater than 4.8, so this alert becomes noise.

Steps to Reproduce the Problem

Upgrade a cluster to v4.8

Actual Behavior

I get noise in OpsGenie. I cannot close the alert since it is constantly firing.

Expected Behavior

I am informed about actual problems

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.