Bug Report
Deployed a dev workload cluster on vSphere - 1 control plane, 1 worker node.
Scaled worker nodes from 1 to 3. Success!
Scaled control plane nodes from 1 to 3. Failure!
What I observed was as follows:
- Second control plane node is successfully cloned, powered on and receives an IP address successfully via DHCP.
- Original control plane node seems to lose its network information (both VM IP and VIP for K8s API server) - observed in in vSphere client UI
- K8s API server no longer reachable via kubectl commands
- CPU Usage on original control plane node/VM triggers vSphere alarm (4.774GHz Used)
Switched to management cluster to look at some logs:
% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:24:34.446929 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:28:57.188000 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:29:57.225857 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
E0705 08:29:57.227366 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:29:57.227913 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:30:52.225222 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:30:57.267482 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:30:57.268704 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:30:57.269114 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:21:35.975933 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:24:33.802071 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:34.446929 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038 1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330 1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486 1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628 1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424 1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
To try and regain access to the cluster, I reset (via the vSphere client) the original control plane node/VM. This allowed the node to regain its networking configuration and I could once again access the API server after it rebooted and I waited a few minutes.
However the control plane is still not reconciled:
% kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
workload-control-plane-gpgp6 NotReady <none> 38m v1.20.4+vmware.1 10.27.51.25 10.27.51.25 VMware Photon OS/Linux 4.19.174-5.ph3 containerd://1.4.3
workload-control-plane-kvj7j Ready control-plane,master 2d20h v1.20.4+vmware.1 10.27.51.61 10.27.51.61 VMware Photon OS/Linux 4.19.174-5.ph3 containerd://1.4.3
workload-md-0-984748884-g5884 Ready <none> 2d20h v1.20.4+vmware.1 10.27.51.63 10.27.51.63 VMware Photon OS/Linux 4.19.174-5.ph3 containerd://1.4.3
workload-md-0-984748884-jtg7q Ready <none> 2d20h v1.20.4+vmware.1 10.27.51.64 10.27.51.64 VMware Photon OS/Linux 4.19.174-5.ph3 containerd://1.4.3
workload-md-0-984748884-pvlnq Ready <none> 2d20h v1.20.4+vmware.1 10.27.51.62 10.27.51.62 VMware Photon OS/Linux 4.19.174-5.ph3 containerd://1.4.3
And the kubelet status on the new node seems to have an issue with the CSI driver, but I cannot tell if this is the root cause:
% ssh [email protected]
Last login: Mon Jul 5 08:44:34 2021 from 10.30.3.96
08:52:54 up 27 min, 0 users, load average: 0.30, 0.34, 0.19
tdnf update info not available yet!
capv@workload-control-plane-gpgp6 [ ~ ]$ sudo su -
root@workload-control-plane-gpgp6 [ ~ ]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2021-07-05 08:28:52 UTC; 24min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 2941 (kubelet)
Tasks: 16 (limit: 4714)
Memory: 46.1M
CGroup: /system.slice/kubelet.service
└─2941 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external --container-runtime=remote --container-runtime-endpoint=/var/run/containerd/containerd.sock --tls-ciphe
r-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 --pod-infra-container-image=projects.registry.vmware.com/tkg/paus
e:3.2
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152310 2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/csi.vsphere.vmware.com-reg.sock <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152321 2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153041 2941 csi_plugin.go:100] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi.vsphere.vmware.com endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock versions: 1.0.0
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153070 2941 csi_plugin.go:113] kubernetes.io/csi: Register new plugin with name: csi.vsphere.vmware.com at endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153104 2941 clientconn.go:106] parsed scheme: ""
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153113 2941 clientconn.go:106] scheme "" not registered, fallback to default scheme
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153146 2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153154 2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153185 2941 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
Jul 05 08:50:31 workload-control-plane-gpgp6 kubelet[2941]: E0705 08:50:31.970708 2941 nodeinfomanager.go:574] Invalid attach limit value 0 cannot be added to CSINode object for "csi.vsphere.vmware.com"
I have managed to repeat this scenario twice with 2 different TKG clusters on vSphere.
Expected Behavior
That the control plane would scale seamlessly.
Steps to Reproduce the Bug
- Deploy a single control plane dev workload cluster
- Attempt to scale the control plane to 3 nodes
Environment Details
v0.5.0
version: v1.3.0
buildDate: 2021-06-03
sha: b261a8b
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+vmware.1", GitCommit:"d475bbd9e7cd66c6db7069cb447766daada65e3b", GitTreeState:"clean", BuildDate:"2021-02-22T22:15:46Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
- Operating System (client):
macOS Big Sur version 11.4