What happened : Do to the RemoveSmbGlobalMapping cleaning up a lot

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RemoveSmbGlobalMapping makes Windows node very slow,about kubernetes-sigs/azurefile-csi-driver

Comments (24)

andyzhangx commented on July 20, 2024 1

@Rob19999 we will ping the version if you upgrade to v1.29.5 or v1.30.0, the fixed version is still there. This process would take one or two days, stay tuned.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Thank you for the quick fix. Is there a way we can opt-in early on the v1.30.3 release or go back to v1.30.1? . Currently we're having major issues with this. Usually it takes around 4-6 weeks before changes like that become available with for our region.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

@Rob19999 v1.29.6 also fixes the issue, we are going to rollout v1.29.6 next month.
just email me the config if you want to make a quick fix on your cluster, thx

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

We're deployed in WestEU and usually we're later on the rollout roadmap. I'm trying to puzzle out what release this change would be in. Currently we're on v20240513 . v20240609 https://github.com/Azure/AKS/releases/tag/2024-06-09 is being rollout atm in westeu but this does not yet contain this fix. A new release had not been announced yet but usually if there is no release announced it will take atleast 4-6 weeks.

The csi driver v1.30.2 was introduced in https://github.com/Azure/AKS/releases/tag/2024-05-13

I'm unsure what config you would like me to send. Currently our cluster is on Kuberntes version 1.29.4 AKS v2024051 but we have no way off choosing the csi-driver version during creation or a update command as for as I'm aware. I will raise this question at Microsoft support given AKS is a managed service. But am afraid its pinned tothe next vxxxxxx version that includes this driver version.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

@Rob19999 the azure file csi driver is managed by aks, and we have a way in backend to ping your csi driver version to the fixed patch version if you want, otherwise you need to wait a few weeks.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

I would love that given the issues we have. We're more then willing to test the new version for you
. If that still causes issues we could always go back to v.1.30.1.

I assume I can raise a support request for this through the Microsoft portal?

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

I would love that given the issues we have. We're more then willing to test the new version for you . If that still causes issues we could always go back to v.1.30.1.

I assume I can raise a support request for this through the Microsoft portal?

@Rob19999 that also works but it would go through a process and takes time.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.

nvm, I got your cluster now: aks-prd-rgxxxc5.hcp.westeurope.azmk8s.io, and if you want to mitigate other clusters, just email me, thx

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Thank you. I will give it some time to propagate. A new node in a existing nodepool I added still pulled mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.2-windows-hp.

Will the pinning of the version disappear when we upgrade the cluster or do we need to reach out to you to get this changes?

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.

Am I in the right assumption we just need to wait for 1.29.5 te become available.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.

Am I in the right assumption we just need to wait for 1.29.5 te become available.

@Rob19999 pls check again, azurefile-csi:v1.30.3-windows-hp image is deployed on your cluster now.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

I see it now. Thanks for the support.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Goodday, We had the driver running for 3 days now and unfortunately we're still experiencing the same issue.

To test the issue in a more controlled manner we split up our deployment over several nodes pools and start making changes in the way we use PVC's. What did seem to help is fasing out dynamic pvc's we create with each deployment and creating 1 static (premium storage) pvc a day and use that instead. While this is workable and we havent had issues on this pool for over a week it is not how it should work.

Our generic load is around 150 -175 HELM release a day (deploy/delete) with 1400-2000 deployments mostly having 1 pod. where each release has it own dynamic pvc where the pods have a persistentVolumeClaim to the azurefile-csi with 1Gi storage and some deploys having 30Gi or 100Gi. With the pvc change we caused around 40% off this load to go to a different node pool that is stable for around a week now. The other nodepool still had 3-4 nodes a day dying off.

We also tested with smaller pods counts on nodes (85 pods), (65 pods) etc. This does not seem to lessen the issue.

We're now working on changing all our workload to use as less pvc's as possible that we pre-create each day. Other pvc's where already more permanent.

While we have a workaround now I would still like to assist Microsoft in a more permanent fix. Not just for our workload but also possible other/future customers of AKS.

Is there anything we can do to see if we can resolve the issue in the driver? I can imagen we cause a big load with our setup but I also feel Windows should be able to handle this. Given its relying an basic SMB functionality that also work on very file server where these numbers of smb connections are not very large.

Thank in advance.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

@Rob19999 could you share the csi driver logs on that node again? what's the csi driver version on that node? and how many PVs are mounted on that node in total?

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Needed to wait for a crash, here is the information. If its easier we can also go on a call sometime. We can save a node for investigation. I redacted all relevant cluster information. To be sure is there a way to mark this message as internal?

I think the user already removed some off the pods after I ran this command.
kubectl exec -n kube-system -it csi-azurefile-node-win-n6dgd -- powershell
get-smbglobalmapping

Disconnected            \\<redacted>.file.core.windows.net\pvc-44150571-38f1-458e-b438-7919a8353018
Disconnected            \\<redacted>.file.core.windows.net\pvc-a02a46f7-7325-4a51-bbd5-138dae704523
OK                      \\<redacted>.file.core.windows.net\pvc-9a3cacab-3290-42e2-982c-6555c6587df2
Disconnected            \\<redacted>.file.core.windows.net\pvc-7e437168-3a59-4801-b8ab-5f5261e7d29d
Disconnected            \\<redacted>.file.core.windows.net\pvc-36f51116-7143-4454-a7f9-0062a08b3e29
Disconnected            \\<redacted>.file.core.windows.net\pvc-d286d91a-7eab-4965-acb1-d45452bce160
OK                      \\<redacted>.file.core.windows.net\pvc-81161d5d-28b5-46f7-bdd7-9cc119496b25
OK                      \\<redacted>.file.core.windows.net\pvc-105a612d-2207-4202-8fe2-f452649159a5
Disconnected            \\<redacted>.file.core.windows.net\pvc-71aad805-b3a4-4c3d-baff-a7cdfaeafdac
Disconnected            \\<redacted>.file.core.windows.net\pvc-3c7f8f21-eb63-4460-9c6a-d6cbac6582dd
Disconnected            \\<redacted>.file.core.windows.net\pvc-558def79-d041-4263-a49c-8c021d679f96
Disconnected            \\<redacted>.file.core.windows.net\pvc-306793ef-5728-43dd-9ad5-fc7997c5c328

kubectl describe pod csi-azurefile-node-win-mzb8k -n kube-system

Name:                 csi-azurefile-node-win-mzb8k
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      csi-azurefile-node-sa
Node:                 aksnpwin200000f/10.241.13.203
Start Time:           Mon, 08 Jul 2024 17:37:34 +0200
Labels:               app=csi-azurefile-node-win
                      controller-revision-hash=7b6b5fd8fb
                      kubernetes.azure.com/managedby=aks
                      pod-template-generation=36
Annotations:          <none>
Status:               Running
SeccompProfile:       RuntimeDefault
IP:                   10.241.13.203
IPs:
  IP:           10.241.13.203
Controlled By:  DaemonSet/csi-azurefile-node-win
Init Containers:
  init:
    Container ID:  containerd://77ab7ccdef01617fb3121f003ea19ead88691fbbea44c4ed18ae0242811fade1
    Image:         mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
    Port:          <none>
    Host Port:     <none>
    Command:
      powershell.exe
      -c
      New-Item -ItemType Directory -Path C:\var\lib\kubelet\plugins\file.csi.azure.com\ -Force
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 09 Jul 2024 08:39:08 +0200
      Finished:     Tue, 09 Jul 2024 08:39:09 +0200
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Containers:
  node-driver-registrar:
    Container ID:  containerd://734ff0ccfb7db13d1db6a8d24f49563ff9a020505e6571b85a75da2ef0ec1426
    Image:         mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.10.1
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar@sha256:b3bbd7a6171bff15eeefd137316fa16415aa6a4c817e5ec609662793093b3526
    Port:          <none>
    Host Port:     <none>
    Command:
      csi-node-driver-registrar.exe
    Args:
      --csi-address=$(CSI_ENDPOINT)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
      --plugin-registration-path=$(PLUGIN_REG_DIR)
      --v=2
    State:          Running
      Started:      Tue, 09 Jul 2024 08:39:14 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 08 Jul 2024 17:37:47 +0200
      Finished:     Tue, 09 Jul 2024 08:38:54 +0200
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  150Mi
    Requests:
      cpu:     40m
      memory:  40Mi
    Environment:
      KUBERNETES_SERVICE_HOST:       <redacted>.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  <redacted>.hcp.westeurope.azmk8s.io
      CSI_ENDPOINT:                  unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      DRIVER_REG_SOCK_PATH:          C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      PLUGIN_REG_DIR:                C:\\var\\lib\\kubelet\\plugins_registry\\
      KUBE_NODE_NAME:                 (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
  azurefile:
    Container ID:  containerd://b8b3ab85cd2c5efb74eef06039d88fb97130272f0725c4ab43fc2969ee70213c
    Image:         mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
    Port:          <none>
    Host Port:     <none>
    Command:
      azurefileplugin.exe
    Args:
      --v=5
      --endpoint=$(CSI_ENDPOINT)
      --nodeid=$(KUBE_NODE_NAME)
      --enable-windows-host-process=true
    State:          Running
      Started:      Tue, 09 Jul 2024 08:39:15 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 08 Jul 2024 17:37:50 +0200
      Finished:     Tue, 09 Jul 2024 08:38:54 +0200
    Ready:          True
    Restart Count:  1
    Environment:
      KUBERNETES_SERVICE_HOST:       <redacted>.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  <redacted>.hcp.westeurope.azmk8s.io
      AZURE_CREDENTIAL_FILE:         C:\k\azure.json
      CSI_ENDPOINT:                  unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      KUBE_NODE_NAME:                 (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-ws59v:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             CriticalAddonsOnly op=Exists
Events:                      <none>

Logging:
See attachment. The node seem to have died around 13:00 although I am seeing timeout couple hours before that.
csi-azurefile-node-win-mzb8.log

from azurefile-csi-driver.

daanroeterink commented on July 20, 2024

Hello,

Colleague of @Rob19999 here. What we notice when a node is going "dead" is that the command Get-SMBGlobalMapping takes a long time to response or doesn't even response at all. Do you know if the WMI part that powershell uses does some sort of locking on the node? Because that would explain the seemingly random "timeouts" we see in the logging.

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

I have disabled --remove-smb-mount-on-windows=false in your cluster, could you check again? thx

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Currently we're outside of working hours so not much is happening on the cluster. The --remove-smb-mount-on-windows was added after we made a support request (support ticket: 2403070050001502) at Microsoft to resolve the nodes breaking after a while. Usually after 14 days or so or when it reached around 701 smbglobalshares.

Back then we got the error below . I will see if it returns or if I can force it with a certain amount off deploys. Given the change we made on our end by reducing pvc mounts it would be harder to reach this amount.

MountVolume.MountDevice failed for volume 'pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14' : rpc error: code = Internal desc = volume(mc_<redacted>-<redacted>_westeurope#fc7a964cdab3c4c3abd74c7#pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14###<redacted>) mount \\\\<redacted>.file.core.windows.net\\pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14 on \\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\file.csi.azure.com\\e0cef412d93e12842f12522496d30f960b5440e97bb4bc578a01da37d85dd7a1\\globalmount failed with NewSmbGlobalMapping failed. output: 'New-SmbGlobalMapping : Not enough memory resources are available to process this command. \\r\\nAt line:1 char:190\\r\\n+ ... ser, $PWord;New-SmbGlobalMapping -RemotePath $Env:smbremotepath -Cred ...\\r\\n+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\r\\n + CategoryInfo : ResourceUnavailable: (MSFT_SmbGlobalMapping:ROOT/Microsoft/...mbGlobalMapping) [New-SmbG \\r\\n lobalMapping], CimException\\r\\n + FullyQualifiedErrorId : Windows System Error 8,New-SmbGlobalMapping\\r\\n \\r\\n', err: exit status 1

from azurefile-csi-driver.

andyzhangx commented on July 20, 2024

@Rob19999 so do you want to keep remove-smb-mount-on-windows feature or not? create 701 smbglobalshares on one node is crazy, will try to find out how to improve remove-smb-mount-on-windows feature to use less resources, that would take time.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

Lets keep the remove-smb-mount-on-window disabled for now. Without this functionality it was more stable for us. We will monitor the number off connections on the nodes and when they reach 600 we will remove them.

Just to make clear given you mentioned resources we got the 'New-SmbGlobalMapping : Not enough memory resources are available to process this command' error on the v.1.30.0 where the remove-smb-mount-on-window was not yet implemented. This error starting showing up when we reached the 701 smbglobalshares. Most of the 701 connections where in a disconnected state back then do to them not being removed.

I can imagen that not many clusters reach 701 smbglobalshares. It depends on how often you upgrade your node images if they're always running. But it can create seemingly random node crashes.

from azurefile-csi-driver.

Rob19999 commented on July 20, 2024

We still have nodes crashing atm with smb errors. Yesterday one and today one. Its a lot better without the remove-smb-mount. We get this error without hitting the 500 smb mounts. I don't expect anything from your but I just want to provide you with as much information as possible. See full logs in the attachment.

The first error happen at : ```
06:05:49.377083 7780 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(##pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0###--suite) mount \.file.core.windows.net\pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0 on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8b7e8149f0c5af6863b83e652d281b9132a12a9ab63e55ad56bcbf18d14d2760\globalmount failed with NewSmbGlobalMapp failed. output: "", err: exit status 0xc0000142
Please refer to http://aka.ms/filemounterror for possible causes and solutions for mount errors.


[azurefile.log](https://github.com/user-attachments/files/16264155/azurefile.log)

from azurefile-csi-driver.

RemoveSmbGlobalMapping makes Windows node very slow about azurefile-csi-driver HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent