appuio / component-openshift4-monitoring Goto Github PK

View Code? Open in Web Editor NEW

1.0 12.0 1.0 766 KB

Commodore component to manage OpenShift 4 cluster monitoring

License: BSD 3-Clause "New" or "Revised" License

Makefile 9.47% Jsonnet 88.50% Shell 2.03%

commodore-component projectsyn syn openshift openshift4 monitoring hacktoberfest vshn-project-ocp

component-openshift4-monitoring's Introduction

Commodore Component: OpenShift4 Monitoring

This is a Commodore Component for OpenShift4 Monitoring.

This repository is part of Project Syn. For documentation on Project Syn and this component, see syn.tools.

Documentation

The rendered documentation for this component is available on the Commodore Components Hub.

Documentation for this component is written using Asciidoc and Antora. It can be found in the docs folder. We use the Divio documentation structure to organize our documentation.

Run the make docs-serve command in the root of the project, and then browse to http://localhost:2020 to see a preview of the current state of the documentation.

After writing the documentation, please use the make docs-vale command and correct any warnings raised by the tool.

Contributing and license

This library is licensed under BSD-3-Clause. For information about how to contribute, see CONTRIBUTING.

component-openshift4-monitoring's People

Contributors

Stargazers

Watchers

Forkers

3kho

component-openshift4-monitoring's Issues

Openshift Console is unable to access prometheus

Openshift console managed by https://github.com/appuio/component-openshift4-console tries to access prometheus to display data like cluster utilization graphs. This is prohibited by the default network-policies, so the openshift-console UI takes a very long time to load and monitoring information is not displayed.

Steps to Reproduce the Problem

Log in to openshift-console
Go to Administator -> Overview
Cluster Utilization graphs are empty and if a lot of 502 errors are returned

Actual Behavior

openshift-console is unable to access prometheus and can therefore not display monitoring information

Expected Behavior

Requests from openshift-console to openshift-monitoring should be allowed so monitoring and cluster utilization can be displayed in the openshift console.

AlertRules in compiled YAML manifests are reordered on each compilation

See #projectsyn-ig (don't want to post customer info here)

Actual Behavior

Each time the catalog is recompiled, alertrules are reordered, causing a new catalog push

Expected Behavior

Catalog recompilation without actual changes do not cause catalog pushes

Disabling the component removes default monitoring stack

When this component gets disabled the monitoring stack installed by default on OCP4 gets removed. The monitoring stack seems not to be installable outside a cluster installation.

Steps to Reproduce the Problem

In a tenant repo disable the component for a cluster (~openshift4-monitoring)
Commodore compile and push

Actual Behavior

ArgoCD removes the openshift-monitoring namespace and thus removes the whole OpenShift monitoring stack

Expected Behavior

Only additional configuration provided through this component is removed when disabling it.

Switch from hack to regular Prometheus rules

The component uses the hack Prometheus rules https://raw.githubusercontent.com/openshift/elasticsearch-operator/c89d0034c17449fc78420cd0719dde0fda795ae3/hack/prometheus-rules.yaml, which have been revised openshift/elasticsearch-operator@f238b7d.

Removed Alerts:

ElasticsearchNodeDiskLowForSegmentMerges

Renamed Alerts:

ElasticsearchBulkRequestsRejectionJumps -> ElasticsearchWriteRequestsRejectionJumps

New alerts:

ElasticsearchOperatorCSVNotSuccessful
ElasticsearchDiskSpaceRunningLow
ElasticsearchHighFileDescriptorUsage

The component is not minor version aware

The component currently assumes that OpenShift 4.5 is running. This causes that some rules do not match for example OpenShift 4.7. In 4.7 the SDN OVS pods disappeared. But the component still applies the rules for OpenShift 4.5.

The alert NodeWithoutOVSPod has been removed: openshift/cluster-network-operator@614e6e1

Steps to Reproduce the Problem

Install OpenShift 4.7 and use the component version v1.0.0
Apply the component openshift4-monitoring

Actual Behavior

The alert NodeWithoutOVSPod does pop up. But his is irrelevant, as OpenShift 4.7 does not require this pods anymore.

A simple workaround is:

parameters:
  openshift4_monitoring:
    alerts:
      ignoreNames:
        # OpenShift >=4.7 does not have OVS pods running
        # https://github.com/openshift/cluster-network-operator/commit/614e6e165a783089ecada659a3547a41b0a21b5b
        - NodeWithoutOVSPod

Expected Behavior

The rule import is based on the OpenShift version release-4.x.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

This repository currently has no open or pending branches.

Check this box to trigger a request for Renovate to run again on this repository

Component library has unsupported name

Officially, Commodore only supports component libraries which are named <component-name>.libj?sonnet.

This component, however provides a component library named prom.libsonnet. While this actually works with the current implementation of Commodore, the library should be called openshift4-monitoring.libsonnet to avoid conflicts if there's ever a component-prom which also provides a component library named prom.libsonnet.

Clarify docs regarding opsgenie credentials

There can onyl be one Heartbeat integration per team, clarify documentation accordingly

Alert when cluster operator is stuck in progress and or failing

Summary

When altering the cluster version, the cluster operator and other operators have a lot of work to do. That work can get blocked for various reasons. This tends not to put the clusters availability at risk. Therefore no alerts due to service unavailabilty are triggered. However ssuch situations should be investigated and resolved.

As a cluster administrator
I want to get notified when the cluster operator remains in state "Progressing" for an extended period
So that I can ensure the cluster can finish up the version change.

Context

This situation is inidated at:

the output of oc adm upgrade --as cluster-admin

info: An upgrade is in progress. Unable to apply 4.5.22: the cluster operator image-registry has not yet successfully rolled out

the OpenShift console at Administration > Cluster Settings > Update Status > View Details.
in monitoring using the Prometheus time series cluster_operator_conditions

Out of Scope

Further links

https://kb.vshn.ch/oc4/how-tos/update_maintenance.html

Acceptance criteria

Alert rule(s) that cover the subject
Guidlines/checklist explaining how to investigate thos allerts.

Implementation Ideas

The PQL expression cluster_operator_conditions{condition="Failing"} > 0 could directly be put into an alert rule with a for of 5 minutes. Other conditions might also be usefull. For example Progressing. When progressing is going into the range of days, it would be worth a look.

Provide OpsGenie Default Configuration in Component

Context

The configuration for OpsGenie has to be copy/pasted to the tenant repository. This component should provide a way to use the default configuration as it is described in https://github.com/appuio/component-openshift4-monitoring/blob/master/docs/modules/ROOT/pages/how-tos/opsgenie.adoc#full-component-configuration without copy/pasting the whole configuration.

In such a default configuration, the support of HTTP proxy vars would be nice.

Alternatives

The current alternative is to copy/paste the configuration which is not only error-prone but will be outdated when there are improvements in the guide.

Support to add a set of labels to a list of alerts

Context

Some alerts are more important than others. The more important ones, we want to route to an on call engineer while the less important ones are left for reactive handling. In order to do so, a common label is needed to route those alerts differently. However the stock alert rules do not come with labels we can work with.

Expose a new parameter that takes a set of alert names. Those alerts get added the labels set on another parameter. Should an alert not exist, the catalog compilation fails with an error message clearly stating what the culprit is.

Alternatives

Right now, we can make use of patchRules. The way they are implemented, we do not notify should an alert no longer exist of have been renamed on a newer version.

Make Alertmanager configurable by this component

Context

The OpenShift 4 documentation explains how to configure Alertmanager. This is done by creating the secret alertmanager-main within the namespace openshift-monitoring.

The key alertmanager.yaml within that secret should be configurable by this component.

Various SYN_PrometheusOperator* alertrules are missing

Various PrometheusOperator* alert rules are missing the SYN_* equivalent. This leads to missed alerts. eg:

Steps to Reproduce the Problem

There are
On a synfected openshift4 cluster:

Generate an alertManagerConfig with errors (eg syntax error)
Compile the catalog
Wait for the change to be applied
The prometheusOperator will not load the new config due to errors in the config
The AlertRule PrometheusOperatorSyncFailed will fire, but never escalated to eg. OpsGenie.

Actual Behavior

Various alert rules regarding the PrometheusOperatorare not seen in eg. OpsGenie

Expected Behavior

All relevant alerts regarding the PRometheusOperator are forwarded to eg OpsGenie

Update component to use `patch-operator.libsonnet` directly

Context

The component currently uses resource-locker.libjsonnet in

component-openshift4-monitoring/component/main.jsonnet

Lines 22 to 35 in 3f05771

 local ns_patch = 

 rl.Patch( 

 kube.Namespace(ns), 

 { 

 metadata: { 

 labels: { 

 'network.openshift.io/policy-group': 'monitoring', 

 } + if std.member(inv.applications, 'networkpolicy') then { 

 [inv.parameters.networkpolicy.labels.noDefaults]: 'true', 

 [inv.parameters.networkpolicy.labels.purgeDefaults]: 'true', 

 } else {}, 

 }, 

 } 

 );

We should update the code to use patch-operator.libsonnet directly.

SYN_ClusterNotUpgradeable messages not rendered

The "Message" annotation for said AlertRule is not rendered and instead displayed verbatim in Prometheus, AlertManager and Opsgenie.

The "original" upstream AlertRule (ClusterNotUpgradeable) is rendered as expected in Prometheus.

Ping me in Chat for Prometheus link if required

Silence cronjob fails with ErrImagePull / ImagePullBackoff

The silence cronjob uses the Docker image quay.io/appuio/oc, and has the image pinned to a specfic sha256:

component-openshift4-monitoring/class/defaults.yml

Lines 67 to 70 in f8c38ba

 images: 

 oc: 

 image: quay.io/appuio/oc 

 tag: v4.6@sha256:d7a50d8812bc13e2a1cef119d268ebb051e9ffb53960942180a36982ec59f647

The Docker image is built and pushed to Quay (and DockerHub) as a GitHub action on https://github.com/appuio/container-oc.

From what I can tell, Quay will delete image layers when they are not referenced anymore by any tag. Checking the sha256 referenced in the component results in "Manifest not found". The last update for the v4.6 tag was on Wed, Jan 27, 2021 8:10 AM, cf. https://quay.io/repository/appuio/oc?tab=history.

Because there's currently no Renovate support for Commodore components, the sha256 in the component's default.yml is not updated when the Docker image is rebuilt. Since Quay appears to delete unreferenced layers, users which have the image pinned to an old version of a tag will run into image pull errors.

As a workaround, users of this component can override the image tag to not include the specific sha256:

parameters:
  openshift4_monitoring:
    images:
      oc:
        tag: 'v4.6'

Steps to Reproduce the Problem

Setup OpenShift 4 cluster
Enable Project Syn and deploy this component
Wait for the silence cronjob to be scheduled

Actual Behavior

The silence job fails with ErrImagePull and goes into ImagePullBackoff.

Expected Behavior

Silence job completes successfully

Provide component library to allow other components to configure alert rules to match the requirements of this component

Context

Right now, if another component wants to provide alert rules that work with the framework established by this component, that component needs to ensure that the alerts have the correct labels etc. (see e.g. projectsyn/component-rook-ceph#3)

While components can reimplement the necessary alert labels and so forth, this leads to a lot of duplicated code. Instead this component should provide a component library which has helper functions that will ensure that alerts are correctly configured within the framework established by this component.

Alternatives

As mentioned above, other components can try to keep up with reimplementing the requirements of the framework established by this component.

AlertRule "ClusterNotUpgradeable" provides no value

Current situation:

ClusterNotUpgradeable is firing into Opsgenie:

Description
Alerts Firing:

- Message: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0) }}{{label "reason" $value}}.{{end}}{{end}}{{end}} {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/settings/cluster/.{{ end }}{{ end }}
Labels:
- alertname = SYN_ClusterNotUpgradeable
- cluster_id = c-xxx-ocp4
- condition = Upgradeable
- endpoint = metrics
- name = version
- prometheus = openshift-monitoring/k8s
- severity = warning
- syn = true
- tenant_id = t-purple-brook-2904
Annotations:
- message = One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0) }}{{label "reason" $value}}.{{end}}{{end}}{{end}} {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/settings/cluster/.{{ end }}{{ end }}
- syn_component = openshift4-monitoring

Upon investigating Prometheus we find that

One or more cluster operators have been blocking minor version cluster upgrades for at least an hour for reason IncompatibleOperatorsInstalled. For more information refer to https://console.apps.xxx/settings/cluster/.

Upon checking the ClusterVersion object we find

Cluster operator operator-lifecycle-manager cannot be upgraded between minor versions: The following operators block OpenShift upgrades: Operator elasticsearch-operator.5.1.0-96 in namespace openshift-operators-redhat is not compatible with OpenShift versions greater than 4.8.0,Operator mtc-operator.v1.5.0 in namespace openshift-migration is not compatible with OpenShift versions greater than 4.8.0

This is great, but there is no minor version greater than 4.8, so this alert becomes noise.

Steps to Reproduce the Problem

Upgrade a cluster to v4.8

Actual Behavior

I get noise in OpsGenie. I cannot close the alert since it is constantly firing.

Expected Behavior

I am informed about actual problems

	local ns_patch =
	rl.Patch(
	kube.Namespace(ns),
	{
	metadata: {
	labels: {
	'network.openshift.io/policy-group': 'monitoring',
	} + if std.member(inv.applications, 'networkpolicy') then {
	[inv.parameters.networkpolicy.labels.noDefaults]: 'true',
	[inv.parameters.networkpolicy.labels.purgeDefaults]: 'true',
	} else {},
	},
	}
	);

	images:
	oc:
	image: quay.io/appuio/oc
	tag: v4.6@sha256:d7a50d8812bc13e2a1cef119d268ebb051e9ffb53960942180a36982ec59f647

appuio / component-openshift4-monitoring Goto Github PK

component-openshift4-monitoring's Introduction

Commodore Component: OpenShift4 Monitoring

Documentation

Contributing and license

component-openshift4-monitoring's People

Contributors

Stargazers

Watchers

Forkers

component-openshift4-monitoring's Issues

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Summary

Context

Out of Scope

Further links

Acceptance criteria

Implementation Ideas

Context

Alternatives

Context

Alternatives

Context

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Context

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Context

Alternatives

Steps to Reproduce the Problem

Actual Behavior

Expected Behavior

Recommend Projects

Recommend Topics

Recommend Org