Git Product home page Git Product logo

cluster-version-operator's Introduction

Cluster Version Operator

The Cluster Version Operator (CVO) is one of the Cluster Operators that run in every OpenShift cluster. CVO consumes an artifact called a "release payload image," which represents a specific version of OpenShift. The release payload image contains the resource manifests necessary for the cluster to function, like all Cluster Operator ones. CVO reconciles the resources within the cluster to match the manifests in the release payload image. As a result, CVO implements cluster upgrades. After being provided a release payload image for a newer OpenShift version, CVO reconciles all Cluster Operators to their updated versions, and Cluster Operators similarly update their operands.

OpenShift Upgrades

For information about upgrading OpenShift clusters, please see the respective documentation:

ClusterVersion Resource

Like other Cluster Operators, the Cluster Version Operator is configured by a Config API resource in the cluster: a ClusterVersion:

$ oc explain clusterversion
  KIND:     ClusterVersion
  VERSION:  config.openshift.io/v1

  DESCRIPTION:
       ClusterVersion is the configuration for the ClusterVersionOperator. This is
       where parameters related to automatic updates can be set. Compatibility
       level 1: Stable within a major release for a minimum of 12 months or 3
       minor releases (whichever is longer).

  FIELDS:
     ...
     spec	<Object> -required-
       spec is the desired state of the cluster version - the operator will work
       to ensure that the desired version is applied to the cluster.

     status	<Object>
       status contains information about the available updates and any in-progress
       updates.

ClusterVersion resource follows the established Kubernetes pattern where the spec property describes the desired state that CVO should achieve and maintain, and the status property is populated by the CVO to describe its status and the observed state.

In a typical OpenShift cluster, there will be a cluster-scoped ClusterVersion resource called version:

$ oc get clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.17   True        False         6d5h    Cluster version is 4.11.17

Note that as a user or a cluster administrator, you usually do not interact with the ClusterVersion resource directly but via either the oc adm upgrade CLI or the web console.

Understanding Upgrades

💡 This section is only a high-level overview. See the Update Process and Reconciliation documents in the dev-guide for more details.

The Cluster Version Operator continuously fetches information about upgrade paths for the configured channel from the OpenShift Update Service (OSUS). It stores the recommended update options in the status.availableUpdates field of its ClusterVersion resource.

The intent to upgrade the cluster to another version is expressed by storing the desired version in the spec.desiredUpdate field. When spec.desiredUpdate does not match the current cluster version, CVO will start updating the cluster. It downloads the release payload image, validates it, and systematically reconciles the Cluster Operator resources to match the updated manifests delivered in the release payload image.

Troubleshooting

A typical OpenShift cluster will have a Deployment resource called cluster-version-operator in the openshift-cluster-version namespace, configured to run a single CVO replica. Confirm that its Pod is up and optionally inspect its log:

$ oc get deployment -n openshift-cluster-version cluster-version-operator
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-version-operator   1/1     1            1           2y227d

$ oc get pods -n openshift-cluster-version -l k8s-app=cluster-version-operator
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-version-operator-6885cc574-674n6   1/1     Running   0          6d5h

$ oc logs -n openshift-cluster-version -l k8s-app=cluster-version-operator
...

The CVO follows the Kubernetes API conventions and sets Conditions in the status of its ClusterVersion resource. These conditions are surfaced by both the OpenShift web console and the oc adm upgrade CLI.

Development

Contributions welcome! Please follow CONTRIBUTING.md and developer documentation.

cluster-version-operator's People

Contributors

abhinavdahiya avatar arjunrn avatar crawford avatar davoska avatar deads2k avatar derekwaynecarr avatar eggfoobar avatar enxebre avatar eparis avatar guillaumerose avatar jcpowermac avatar jottofar avatar lalatendumohanty avatar lucab avatar marun avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar patrickdillon avatar paulfantom avatar petr-muller avatar qjkee avatar ravisantoshgudimetla avatar sdodson avatar smarterclayton avatar soltysh avatar sttts avatar vikaschoudhary16 avatar vrutkovs avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-version-operator's Issues

Question: how to config the CSV to use the latest version of one operator?

Hi, All

I built a cluster by using the openshif-installer v0.6.0 today. But, I find the manifest of the OLM is old, so how to use the latest version of the OLM? Details as below:

[jzhang@dhcp-140-18 installer]$  oc get clusterversion version -o yaml
...
    current:
      payload: quay.io/openshift-release-dev/ocp-release@sha256:4f02d5c7183360a519a7c7dbe601f58123c9867cd5721ae503072ae62920575b
      version: 0.0.1-2018-12-08-172651
...
[jzhang@dhcp-140-18 installer]$ oc adm release extract --from=quay.io/openshift-release-dev/ocp-release@sha256:4f02d5c7183360a519a7c7dbe601f58123c9867cd5721ae503072ae62920575b --to=release-payload
[jzhang@dhcp-140-18 installer]$ ls release-payload/ | grep 30
0000_30_00-namespace.yaml
0000_30_01-olm-operator.serviceaccount.yaml
0000_30_02-clusterserviceversion.crd.yaml
0000_30_03-installplan.crd.yaml
0000_30_04-subscription.crd.yaml
0000_30_05-catalogsource.crd.yaml
0000_30_06-rh-operators.configmap.yaml
0000_30_07-certified-operators.configmap.yaml
0000_30_08-certified-operators.catalogsource.yaml
0000_30_09-rh-operators.catalogsource.yaml
0000_30_10-olm-operator.deployment.yaml
0000_30_11-catalog-operator.deployment.yaml
0000_30_12-aggregated.clusterrole.yaml
0000_30_13-packageserver.csv.yaml
0000_30_14-operatorgroup.crd.yaml

One more question, if I want to update this CVO to the latest version, how to do that? The current version is 0.0.1-2018-12-08-172651.

Failure to update image registry service account due to cache sync error

CVO cannot deploy the image registry due to a cache sync error.

CVO payload: registry.svc.ci.openshift.org/openshift/origin-release:v4.0
CVO version: 4.0.0-0.alpha-2018-11-30-060640

Error in logs:

E1130 14:43:49.424010       1 sync.go:133] error running apply for serviceaccount "openshift-image-registry/cluster-image-registry-operator" (v1, 125 of 183): serviceaccounts "cluster-image-registry-operator" is forbidden: caches not synchronized

Proxy Configuration for 4.1

Hi

Is there any way to get the communication to api.openshift.com running through a HTTP Proxy on Version 4.1.7?

We can't update the cluster over the UI because it can't reach api.openshift.com as the HTTP Proxy seems not to be configured in the operator.

Everything else on the cluster is configured to run through Proxy.

Regards

Cluster upgrade internal error caused by self signed cert

Upgrade from 4.7.0 to 4.7.9, the x509 error occurred in 2 places, one is calling api-int.xxxx and prometheus-operator.openshift-monitoring.svc:8080, was able to fix the first one by manually update-ca-trust the cert from the API server. However not sure how to do with 2nd one since it's a cluster internal URI.

cluster-version-operator log shown below:

I0523 01:03:07.033591       1 cvo.go:481] Started syncing cluster version "openshift-cluster-version/version" (2021-05-23 01:03:07.033585342 +0000 UTC m=+82331.507064819)
I0523 01:03:07.041158       1 cvo.go:510] Desired version from spec is v1.Update{Version:"4.7.9", Image:"quay.io/openshift-release-dev/ocp-release@sha256:5a5433a5f82a10c78783d7aed7d556d26602295ee8e9dcfaba97ebc1ab0bc2ac", Force:false}
I0523 01:03:07.041264       1 sync_worker.go:227] Update work is equal to current target; no change required
I0523 01:03:07.041289       1 status.go:161] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"ApplyResources", Failure:error(nil), Done:8, Total:668, Completed:0, Reconciling:false, Initial:false, VersionHash:"qi_N6BhDM3k=", LastProgress:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Actual:v1.Release{Version:"4.7.9", Image:"quay.io/openshift-release-dev/ocp-release@sha256:5a5433a5f82a10c78783d7aed7d556d26602295ee8e9dcfaba97ebc1ab0bc2ac", URL:"https://access.redhat.com/errata/RHBA-2021:1365", Channels:[]string(nil)}, Verified:false}
I0523 01:03:07.041331       1 status.go:81] merge into existing history completed=false desired=v1.Release{Version:"4.7.9", Image:"quay.io/openshift-release-dev/ocp-release@sha256:5a5433a5f82a10c78783d7aed7d556d26602295ee8e9dcfaba97ebc1ab0bc2ac", URL:"https://access.redhat.com/errata/RHBA-2021:1365", Channels:[]string{"candidate-4.7", "candidate-4.8", "fast-4.7", "stable-4.7"}} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63757219493, loc:(*time.Location)(0x223c360)}}, CompletionTime:(*v1.Time)(nil), Version:"4.7.9", Image:"quay.io/openshift-release-dev/ocp-release@sha256:5a5433a5f82a10c78783d7aed7d556d26602295ee8e9dcfaba97ebc1ab0bc2ac", Verified:true}
I0523 01:03:07.041474       1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (7.885146ms)
E0523 01:03:07.136035       1 task.go:112] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 668): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
I0523 01:03:07.193126       1 cvo.go:554] Finished syncing available updates "openshift-cluster-version/version" (162.643079ms

"Setting objects unmanaged" section issues

  • The cluster-network file is 0000_70_cluster-network-operator_03_daemonset.yaml

  • The release-image/0000_07_cluster-network-operator_03_daemonset.yaml file even if it says daemonset it is actually a deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: network-operator
  namespace: openshift-network-operator
- op: add
  path: /spec/overrides
  value:
  - kind: Deployment
    group: apps                        <--------- This
    name: cluster-network-operator
    namespace: openshift-network-operator
    unmanaged: true

nodeSelector for Level 1 operators

Hi,

We're trying to setup an openshift 4.4.3 cluster for a customer and they need all of the worker nodes to be dedicated only to their business application pods (apart from daemonsets ofcourse). But the default installation comes with a bunch of cluster operators bundled in the release image and the manifests cannot be changed for them. This leads to some pods of cluster operators like marketplace operator scheduled on worker nodes which don't have a nodeSelector and other operators have a generic linux nodeSelector on them. Ofcourse we can create infra nodes and manually change nodeSelector to move stuff there. But the manifests will get reset with upgrades and we will have to manually patch them again. Is there are way to properly define nodeSelectors for Level 1 operators like cluster operators (Assuming CVO is Level 0 operator)? It would be great if we get even a small lead on that and we can contribute if the feature is not there. Thanks

Port conflict when running Calico CNI with Openshift

Hi,
I'm using the Cluster Version Operator on openshift 4.12 and OVN-Kubernetes CNI plugin.
Now I want to test my cluster with Calico, but the port 9099 is used by CVO and doesn't allow the Calico routing module (Felix) to start.
I would like to know if stop CVO can cause any issue in my cluster ?
How can I change the listening port of CVO ?
Best Regards,

Able to trigger a leader election panic, use of closed channel

TEST_INTEGRATION=1 go test ./pkg/start/ -tags integration -count=4
E0325 22:45:48.713617   10713 sync_worker.go:276] unable to synchronize image (waiting 625ms): Could not update configmap "e2e-cvo-ff4l/config2" (2 of 2): the object is invalid, possibly due to local cluster configuration
E0325 22:45:48.922205   10713 leaderelection.go:256] error initially creating leader election record: namespaces "e2e-cvo-mlm6zv" not found
E0325 22:45:54.301108   10713 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'claytonc-mbp.local_4bb2f5bb-5f1d-4543-90dd-75de07986a26 stopped leading'
panic: close of closed channel

goroutine 63 [running]:
github.com/openshift/cluster-version-operator/pkg/start.(*Options).run.func3()
	/Users/clayton/projects/origin/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:190 +0x74
github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc0006e4240)
	/Users/clayton/projects/origin/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:148 +0x40
github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0006e4240, 0x216d1e0, 0xc000184600)
	/Users/clayton/projects/origin/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:157 +0x112
github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x216d220, 0xc0000da010, 0x2174320, 0xc0001e5d40, 0x14f46b0400, 0xa7a358200, 0x6fc23ac00, 0xc0003b03f0, 0xc0000d13b0, 0x0)
	/Users/clayton/projects/origin/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:166 +0x87
created by github.com/openshift/cluster-version-operator/pkg/start.(*Options).run
	/Users/clayton/projects/origin/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:157 +0x1ef
FAIL	github.com/openshift/cluster-version-operator/pkg/start	75.329s

Does the clusterversion still have `.status.current.image`?

Just run the command at https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusterversion.md#finding-your-current-update-image , and the output is empty for my cluster as follows:

[root@ocp42-inf ~]# oc get clusterversion -o jsonpath='{.status.current.image}{"\n"}' version

Then check the API of ClusterVersionStatus at https://github.com/openshift/api/blob/master/config/v1/types_cluster_version.go#L81 and found that current is not a field any more, am I missing anything? Shall I get this from .status.history[0].image?

[root@ocp42-inf ~]# oc get clusterversion -o jsonpath='{.status.history[0].image}{"\n"}' version
quay.io/openshift-release-dev/ocp-release@sha256:c5337afd85b94c93ec513f21c8545e3f9e36a227f55d41bc1dfb8fcc3f2be129

Ubi image unavailable to public

Hey folks,

As apart of the contributing documentation there's no mention of requiring a login to build the container image. In this case it's the registry.ci.openshift.org/ocp/ubi container image that appears to require authentication.

Can you please clarify if there is a authentication method for public contributors or if there should be a different URL there?

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

Question: auto update enabled

Question: How does the CVO monitors for new image when auto-update is enabled?

How does it looks for new ocp-release image when pushed to the container catalog(which would be used in future for ocp 4.0 if I am not wrong)?

Let me know your thoughts on it.

Some informers do not check if the cache `HasSynced()`

Hi, team! I'm a newbie to openshift. As I read the source code about CVO, I found only check coInformer's and cvInformer's cache HasSynced(), but did not check others. Is that a special design?

optr.coLister = coInformer.Lister()
optr.cacheSynced = append(optr.cacheSynced, coInformer.Informer().HasSynced)
optr.cvLister = cvInformer.Lister()
optr.cacheSynced = append(optr.cacheSynced, cvInformer.Informer().HasSynced)
optr.proxyLister = proxyInformer.Lister()
optr.cmConfigLister = cmConfigInformer.Lister().ConfigMaps(internal.ConfigNamespace)
optr.cmConfigManagedLister = cmConfigManagedInformer.Lister().ConfigMaps(internal.ConfigManagedNamespace)

How to update object as the old spec seems to be deprecated

To get a list of current overrides, run:

$ oc get -o json clusterversion version | jq .spec.overrides
[
{
"kind": "APIService",
"name": "v1alpha1.packages.apps.redhat.com",
"unmanaged": true
}
]

However, the spec dont contain overrides section

spec:
channel: fast
clusterID: 37be53b4-bdbc-4b65-b76e-ddf9c2b671c6
upstream: http://localhost:8080/graph

So all the following steps wont work.

Similarly, below command needs to be updated to:

oc get clusterversion -o jsonpath='{.status.current.payload}{"\n"}' version
To
oc get clusterversion -o jsonpath='{.status.desired.payload}{"\n"}' version

Imagestreams.image.openshift.io "origin-v4.0" not found

The error message displayed imagestreams.image.openshift.io "origin-v4.0" not found when I execute the command as follows:

oc adm release new -n openshift --server https://api.ci.openshift.org \
    --from-image-stream=origin-v4.0 \
    --to-image-base=docker.io/abhinavdahiya/origin-cluster-version-operator:latest \
    --to-image docker.io/abhinavdahiya/origin-release:latest

The step follow the doc at here.

How can I execute this command correctly?

Errors while upgrading and sync loop interactions

The sync loop being inline with the status update loop makes upgrades hard to debug (we have to wait for the whole loop to complete to update status). Also, if a sync is running and you change the desired state, we need to wait for the sync to complete. There's some other errors that then show up to users that are non obvious.

Instead, we should decouple the CV status from the sync loop, and make the sync loop cancellable.

To do that, I think we should:

  1. Create a new sync loop, similar to available updates
  2. Have the main loop (the one that reacts to CV changes) issue instructions to the sync loop like: Sync to version X
  3. The sync loop would handle reconciling payloads, rate limiting how often it starts a resync (vs a change to a new version), and reporting completion. On completion, the sync loop would requeue the CV (like available updates loop does).
  4. The main loop would observe the state of the sync loop and update status (like available updates).
  5. During a sync, have the CV status get updated, although no more often than every 30s or so
  6. Use context signalling to stop an update in progress (the code should be structured around a context)

End outcomes:

  1. When a user sets a desired update and we are syncing the previous version, we stop immediately
  2. When the user sets a desired update and we are still processing a previous update, we communicate that in status to the user immediately but continue processing if we aren't hitting a failure
  3. When the user clears desiredUpdate we immediately go back to syncing our current state
  4. Status updates from CVO feel immediate

Part of https://jira.coreos.com/browse/CORS-954

Provide the cluster ID in cluster_version Prometheus metrics

Access to the Cluster Version k8s object requires cluster role access, which makes it difficult to obtain the cluster ID.

Operator-metering is a useful tool for building reports from Prometheus data. For upcoming flows and customer interactions (support, billing, etc) it would be beneficial for the reports to contain the cluster ID.

If the cluster ID was available as label or its own metric in Prometheus that would help to simplify report origination.

Missing the specification for usage of --serving-key-file if --listen option is not unset.

While trying to run the CVO locally as per the doc/dev, the section specifies the following command:
./_output/linux/amd64/cluster-version-operator -v5 start --release-image 4.4.0-rc.4

But, as the --listen is set to "0.0.0.0:9099" by default, the command fails with the following error unless we append --listen="" to the command:
F0926 00:28:13.708624 62174 start.go:24] error: --listen was not set empty, so --serving-cert-file must be set

The documentation should be updated to either specify the --serving-cert-file or at least to unset the listen option --listen="".

a5a76d5 panic: interface conversion: runtime.Object is *unstructured.UnstructuredList, not *unstructured.Unstructured [recovered]

In a libvirt cluster I just launched using:

$ openshift-install version
openshift-install v0.1.0-52-gedc4d97104f7fefbe6ce778d18aaf53299f8af59
Terraform v0.11.8

I'm seeing:

[core@wking-bootstrap ~]$ kubectl logs -n openshift-cluster-version bootstrap-cluster-version-operator-wking-bootstrap
I1005 21:39:48.036769       1 start.go:67] ClusterVersionOperator v0.0.0-97-ga5a76d51-dirty
I1005 21:39:48.037010       1 start.go:180] Loading kube client config from path "/etc/kubernetes/kubeconfig"
...
E1005 21:42:10.909970       1 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cluster-version-operator", GenerateName:"", Namespace:"openshift-cluster-version", SelfLink:"/api/v1/namespaces/openshift-cluster-version/configmaps/cluster-version-operator", UID:"d45bde3c-c8e4-11e8-8408-0214269547a8", ResourceVersion:"9259", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63674371377, loc:(*time.Location)(0x1bf25a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"wking-bootstrap_efc11fa1-5144-45db-a2cb-568952d64f05\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2018-10-05T21:42:10Z\",\"renewTime\":\"2018-10-05T21:42:10Z\",\"leaderTransitions\":8}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap'. Will not report event: 'Normal' 'LeaderElection' 'wking-bootstrap_efc11fa1-5144-45db-a2cb-568952d64f05 became leader'
...
I1005 21:42:12.267560       1 sync.go:24] Running sync for (servicecertsigner.config.openshift.io/v1alpha1, Kind=ServiceCertSignerOperatorConfig) /instance
E1005 21:42:12.534057       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
I1005 21:42:12.556363       1 sync.go:60] Done syncing for (servicecertsigner.config.openshift.io/v1alpha1, Kind=ServiceCertSignerOperatorConfig) /instance
...
I1005 21:42:14.179926       1 sync.go:60] Done syncing for (/v1, Kind=Service) openshift-operator-lifecycle-manager/package-server
I1005 21:42:14.180040       1 sync.go:24] Running sync for (image.openshift.io/v1, Kind=ImageStream) /
I1005 21:42:14.336201       1 request.go:485] Throttling request took 155.956498ms, request: GET:https://wking-api.installer.testing:6443/apis/image.openshift.io/v1/imagestreams
I1005 21:42:14.349166       1 cvo.go:201] Finished syncing operator "openshift-cluster-version/cluster-version-operator" (3.337348292s)
E1005 21:42:14.349428       1 runtime.go:66] Observed a panic: &runtime.TypeAssertionError{interfaceString:"runtime.Object", concreteString:"*unstructured.UnstructuredList", assertedString:"*unstructured.Unstructured", missingMethod:""} (interface conversion: runtime.Object is *unstructured.UnstructuredList, not *unstructured.Unstructured)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/iface.go:252
/usr/local/go/src/runtime/iface.go:262
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/dynamic/simple.go:197
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/internal/generic.go:31
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/internal/generic.go:88
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync.go:51
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync.go:33
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:243
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:115
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:173
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:162
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:146
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: interface conversion: runtime.Object is *unstructured.UnstructuredList, not *unstructured.Unstructured [recovered]
	panic: interface conversion: runtime.Object is *unstructured.UnstructuredList, not *unstructured.Unstructured

goroutine 76 [running]:
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x11562c0, 0xc4202d3e00)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/dynamic.(*dynamicResourceClient).Get(0xc420e173b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/client-go/dynamic/simple.go:197 +0x90f
github.com/openshift/cluster-version-operator/pkg/cvo/internal.applyUnstructured(0x13e3760, 0xc420e173b0, 0xc4200a6ca0, 0xc4200a6ca0, 0x0, 0x0, 0x13e3760)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/internal/generic.go:31 +0x99
github.com/openshift/cluster-version-operator/pkg/cvo/internal.(*genericBuilder).Do(0xc420576180, 0xc420405b80, 0x29f)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/internal/generic.go:88 +0x72
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).syncUpdatePayload.func1(0xa, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync.go:51 +0x241
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x2540be400, 0x3ff4cccccccccccd, 0x0, 0x3, 0xc4207e5b20, 0x2c0, 0xc42041ef00)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203 +0x9c
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).syncUpdatePayload(0xc42043cb00, 0xc420481380, 0xc4205511a0, 0x3b, 0xc4205511a0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync.go:33 +0x749
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).sync(0xc42043cb00, 0xc4203d4440, 0x32, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:243 +0x49a
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).(github.com/openshift/cluster-version-operator/pkg/cvo.sync)-fm(0xc4203d4440, 0x32, 0xc4203e3b00, 0x10e38c0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:115 +0x3e
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).processNextWorkItem(0xc42043cb00, 0xc4203d2800)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:173 +0xe0
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).worker(0xc42043cb00)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:162 +0x2b
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).(github.com/openshift/cluster-version-operator/pkg/cvo.worker)-fm()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:146 +0x2a
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42026fae0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42026fae0, 0x3b9aca00, 0x0, 0x1, 0xc42008c900)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42026fae0, 0x3b9aca00, 0xc42008c900)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:146 +0x1d0

Unable to add whitelist annotation to cli downloads route

We would like to be able to restrict by IP access to the downloads route in the openshift-console project using the ip whitelist annotation.

This currently does not appear possible as any update to add the required annotation is reverted by the cluster-version-operator

getOverrideForManifest does not check manifest.GVK.Group

We have the following override in our ClusterVersion:

    - group: imageregistry.operator.openshift.io
      kind: Config
      name: cluster
      namespace: ""
      unmanaged: true

This is causing cluster provisioning to fail, because when the operator encounters this manifest...

0000_30_config-operator_01_operator.cr.yaml

apiVersion: operator.openshift.io/v1
kind: Config
metadata:
  name: cluster
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
spec:
  managementState: Managed

... the getOverrideForManifest function is improperly matching it to the above imageregistry.operator.openshift.io override because it disregards the Group in its comparison (imageregistry.operator.openshift.io != operator.openshift.io).

As a result, the cluster-config-operator has no custom resource to act on and it blocks the cluster-version-operator from ever completing:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h18m   Working towards 4.9.7: 725 of 735 done (98% complete), waiting on config-operator

openshift/api <-> openshift/cluster-version-operator ClusterStatusConditionType mismatch

What happened

When CVO reports failing cluster v4.4 it returns Failing condition e.g.:

conditions:
    - lastTransitionTime: "2020-08-12T07:08:36Z"
      message: Done applying 4.4.10
      status: "True"
      type: Available
    - lastTransitionTime: "2020-08-12T06:53:47Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-08-12T07:08:36Z"
      message: Cluster version is 4.4.10
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-08-12T07:08:57Z"
      message: The update channel has not been configured.
      reason: NoChannel
      status: "False"
      type: RetrievedUpdates

This is an expected state related to the code: https://github.com/openshift/cluster-version-operator/blob/release-4.4/pkg/cvo/status.go#L240

However, the openshift/api expects one of the states https://github.com/openshift/api/blob/release-4.4/config/v1/types_cluster_operator.go#L141:

	OperatorAvailable ClusterStatusConditionType = "Available"
	OperatorProgressing ClusterStatusConditionType = "Progressing"
	OperatorDegraded ClusterStatusConditionType = "Degraded"
	OperatorUpgradeable ClusterStatusConditionType = "Upgradeable"

What you expected to happen

openshift/cvo and openshift/api to have mathich conditions defined.

Server Forbidden Updates To This Resource

Not sure what I've got here but this is a CI failure bringing up a cluster, from the CVO pod logs:

E0213 17:40:08.957050       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
E0213 17:40:28.980158       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
I0213 17:40:38.501814       1 leaderelection.go:209] successfully renewed lease openshift-cluster-version/version
I0213 17:40:40.483182       1 reflector.go:286] github.com/openshift/cluster-version-operator/vendor/github.com/openshift/client-go/config/informers/externalversions/factory.go:101: forcing resync
E0213 17:40:51.996136       1 task.go:57] error running apply for service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): services "controller-manager-service" is forbidden: caches not synchronized
I0213 17:40:51.996197       1 task_graph.go:438] No more reachable nodes in graph, continue
I0213 17:40:51.996203       1 task_graph.go:474] No more work
I0213 17:40:51.996221       1 task_graph.go:494] No more work for 3
I0213 17:40:51.996227       1 task_graph.go:494] No more work for 6
I0213 17:40:51.996234       1 task_graph.go:494] No more work for 7
I0213 17:40:51.996240       1 task_graph.go:494] No more work for 1
I0213 17:40:51.996246       1 task_graph.go:494] No more work for 4
I0213 17:40:51.996252       1 task_graph.go:494] No more work for 2
I0213 17:40:51.996252       1 task_graph.go:494] No more work for 0
I0213 17:40:51.996257       1 task_graph.go:494] No more work for 5
I0213 17:40:51.996277       1 task_graph.go:510] Workers finished
I0213 17:40:51.996290       1 task_graph.go:518] Result of work: [Could not update service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): the server has forbidden updates to this resource]
E0213 17:40:51.996341       1 sync_worker.go:263] unable to synchronize image (waiting 3m19.747206386s): Could not update service "openshift-cloud-credential-operator/controller-manager-service" (84 of 273): the server has forbidden updates to this resource
I0213 17:40:51.996400       1 cvo.go:298] Started syncing cluster version "openshift-cluster-version/version" (2019-02-13 17:40:51.996393402 +0000 UTC m=+2487.354191867)
I0213 17:40:51.996446       1 cvo.go:326] Desired version from operator is v1.Update{Version:"0.0.1-2019-02-13-164905", Image:"registry.svc.ci.openshift.org/ci-op-girsxxlp/release@sha256:ded54f5fb7dfe10f53176ac710f6309b05828dc0aa276b448ce5aefc8e5eae78"}
I0213 17:40:51.996541       1 cvo.go:300] Finished syncing cluster version "openshift-cluster-version/version" (144.1µs)

More logs available here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/31/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/158

Upgrade jobs dead, clusterversion message unhelpful

Basically all our upgrade jobs are dead. The CVO should be able to say what isn't yet completed, but it just says "%x complete". We should try to make a reasonable message to aid debugging. The source error here:

all update jobs are basically dead

        {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-06-29T12:41:31Z",
                "generation": 1,
                "name": "machine-config",
                "resourceVersion": "52233",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/machine-config",
                "uid": "38f06bd4-9a6b-11e9-b262-12ce5335583c"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-06-29T13:48:35Z",
                        "message": "Cluster not available for 0.0.1-2019-06-29-122423",
                        "status": "False",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-06-29T13:35:16Z",
                        "message": "Working towards 0.0.1-2019-06-29-122423",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-06-29T13:48:35Z",
                        "message": "Unable to apply 0.0.1-2019-06-29-122423: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)",
                        "reason": "FailedToSync",
                        "status": "True",
                        "type": "Degraded"
                    }
                ],
                "extension": {
                    "lastSyncError": "error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)",
                    "master": "pool is degraded because nodes fail with \"1 nodes are reporting degraded status on sync\": \"Node ip-10-0-132-28.ec2.internal is reporting: \\\"failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1\\\"\"",
                    "worker": "pool is degraded because nodes fail with \"1 nodes are reporting degraded status on sync\": \"Node ip-10-0-139-237.ec2.internal is reporting: \\\"failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1\\\"\""
                },

https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/directory/pull-ci-openshift-origin-master-e2e-aws-upgrade?before=2882

first two I looked at were the machine config operator

CVO cannot rely on openshift resources

The CVO is creating an SCC for itself. This fails when installing on a kube-cluster and since the openshift-apiserver is created via an operator installed by the CVO, this creates a cycle.

Instead, create a clusterrole and clusterrolebinding for the SCC that will eventually exist.

CVO needs to run without the service network being functional

The CVO needs to be able to install operators that take care of creating a viable service network. To do this, it could run with hostnetworking=true and cleverly setting KUBERNETES_SERVICE_PORT and KUBERNETES_SERVICE_HOST to point to the local kube-apiserver.

If we did this, I think it could come up after the apiserver, controller, scheduler and before anything else. I bumped into this while trying to get running from a kube control plane.

@mfojtik @knobunc @ironcladlou @abhinavdahiya @smarterclayton @derekwaynecarr

pkg/cvo: sync_worker reports incorrect metrics on Complete()

Referring to this code:

metricPayload.WithLabelValues(r.version, "pending").Set(float64(r.total))
metricPayload.WithLabelValues(r.version, "applied").Set(float64(r.total))

Assuming the sync went correctly this could look like:

metricPayload.WithLabelValues(r.version, "pending").Set(float64(r.total-r.done)) 
metricPayload.WithLabelValues(r.version, "applied").Set(float64(r.done)) 

which should be the same as:

metricPayload.WithLabelValues(r.version, "pending").Set(float64(0)) 
metricPayload.WithLabelValues(r.version, "applied").Set(float64(r.total)) 

I suggest to put it to the test and use the former.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.