A trunk initializing conflict was observed when the controller was adding a node and initializing its trunk resources. Looks like the node was failed to be initialized for some reason and removed by the controller. During the other attempts adding the node, the trunk interface was successfully initialized and added into cache. Afterwards the other attempt tried to create and add a trunk interface for the same node again. Since the cache uses node name as key for trunk resource, the attempt failed the node was failed to be initialized. Without correctly removing the trunk record in cache, reconciling keeps failing initializing the node resource and resulted in no pods with SGP feature can be created in the node.
Observed Behavior:
Pods using the Security Group for Pods feature can not be created successfully due to the error in VPC CNI log.
{"level":"info","ts":"2022-07-12T21:59:50.620Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: failed to get Branch ENI resource"}
The node has correct capacity and allocable branch ENIs and trunk ENI attached.
The failing pods have request/limit branch ENI annotated but don't have branch interface network annotation added.
Expected Behavior:
The node should be added into managed pool successfully even some type of race/conflict occurs. The node's trunk/branch interfaces resources should be successfully initialized. Pods scheduled to the node should have their branch network resource annotation added successfully.
How to reproduce it (as minimally and precisely as possible):
N/A. This could be caused by rare race condition between routines or incorrect fallback when a conflict was encountered.
Additional Context:
{"level":"info","timestamp":"2022-07-11T13:09:28.873Z","logger":"node validation webhook","msg":"update request received from aws-node","node":"ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"info","timestamp":"2022-07-11T17:58:28.121Z","logger":"node manager","msg":"node removed from data store","node name":"ip-x-x-x-x.us-west-2.compute.internal","request":"delete"}
{"level":"info","timestamp":"2022-07-11T17:58:28.122Z","logger":"controllers.Node","msg":"deleted the node from manager","node":"/ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"info","timestamp":"2022-07-11T18:05:34.315Z","logger":"controllers.Node","msg":"adding node","node":"/ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"info","timestamp":"2022-07-11T18:05:57.309Z","logger":"node manager","msg":"node was previously un-managed, will be added as managed node now","node name":"ip-x-x-x-x.us-west-2.compute.internal","request":"update"}
{"level":"error","timestamp":"2022-07-11T18:38:28.359Z","logger":"node manager","msg":"removing the node from cache as it failed to initialize","node":"ip-x-x-x-x.us-west-2.compute.internal","operation":"Init","error":"failed to load instance details: failed to find instance i-xxxxxxxxxx details from EC2 API","stacktrace":"github.com/aws/amazon-vpc-resource-controller-k8s/pkg/node/manager.(*manager).performAsyncOperation\n\t/workspace/pkg/node/manager/manager.go:307\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).processNextItem\n\t/workspace/pkg/worker/worker.go:147\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).runWorker\n\t/workspace/pkg/worker/worker.go:132"}
{"level":"info","timestamp":"2022-07-11T18:41:20.846Z","logger":"controllers.Node","msg":"adding node","node":"/ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"info","timestamp":"2022-07-11T18:41:20.846Z","logger":"node manager","msg":"node added as a managed node","node name":"ip-x-x-x-x.us-west-2.compute.internal","request":"add"}
{"level":"info","timestamp":"2022-07-11T18:42:37.930Z","logger":"node manager.node resource handler","msg":"node is not initialized yet, will not advertise the capacity","node name":"ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"info","timestamp":"2022-07-11T18:42:42.470Z","logger":"branch eni provider","msg":"created a new trunk interface","node name":"ip-x-x-x-x.us-west-2.compute.internal","request":"initialize","instance ID":{},"trunk id":"eni-xxxxxxxxx"}
{"level":"info","timestamp":"2022-07-11T18:42:42.470Z","logger":"branch eni provider","msg":"trunk added to cache successfully","node":"ip-x-x-x-x.us-west-2.compute.internal"}
{"level":"error","timestamp":"2022-07-11T18:42:53.725Z","logger":"branch eni provider","msg":"trunk already exist in cache","node":"ip-x-x-x-x.us-west-2.compute.internal","error":"trunk eni already exist in cache","stacktrace":"github.com/aws/amazon-vpc-resource-controller-k8s/pkg/provider/branch.(*branchENIProvider).addTrunkToCache\n\t/workspace/pkg/provider/branch/provider.go:421\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/provider/branch.(*branchENIProvider).InitResource\n\t/workspace/pkg/provider/branch/provider.go:172\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/node.(*node).InitResources\n\t/workspace/pkg/node/node.go:130\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/node/manager.(*manager).performAsyncOperation\n\t/workspace/pkg/node/manager/manager.go:305\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).processNextItem\n\t/workspace/pkg/worker/worker.go:147\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).runWorker\n\t/workspace/pkg/worker/worker.go:132"}
{"level":"error","timestamp":"2022-07-11T18:42:53.725Z","logger":"node manager.node resource handler","msg":"failed to init resource","node name":"ip-x-x-x-x.us-west-2.compute.internal","error":"trunk eni already exist in cache","stacktrace":"github.com/aws/amazon-vpc-resource-controller-k8s/pkg/node.(*node).InitResources\n\t/workspace/pkg/node/node.go:145\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/node/manager.(*manager).performAsyncOperation\n\t/workspace/pkg/node/manager/manager.go:305\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).processNextItem\n\t/workspace/pkg/worker/worker.go:147\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).runWorker\n\t/workspace/pkg/worker/worker.go:132"}
{"level":"error","timestamp":"2022-07-11T18:42:53.725Z","logger":"node manager","msg":"removing the node from cache as it failed to initialize","node":"ip-x-x-x-x.us-west-2.compute.internal","operation":"Init","error":"failed to init resources: trunk eni already exist in cache","stacktrace":"github.com/aws/amazon-vpc-resource-controller-k8s/pkg/node/manager.(*manager).performAsyncOperation\n\t/workspace/pkg/node/manager/manager.go:307\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).processNextItem\n\t/workspace/pkg/worker/worker.go:147\ngithub.com/aws/amazon-vpc-resource-controller-k8s/pkg/worker.(*worker).runWorker\n\t/workspace/pkg/worker/worker.go:132"}