Comments (24)
@Rob19999 we will ping the version if you upgrade to v1.29.5 or v1.30.0, the fixed version is still there. This process would take one or two days, stay tuned.
from azurefile-csi-driver.
thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget
, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.
from azurefile-csi-driver.
thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget
, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.
from azurefile-csi-driver.
Thank you for the quick fix. Is there a way we can opt-in early on the v1.30.3 release or go back to v1.30.1? . Currently we're having major issues with this. Usually it takes around 4-6 weeks before changes like that become available with for our region.
from azurefile-csi-driver.
@Rob19999 v1.29.6 also fixes the issue, we are going to rollout v1.29.6 next month.
just email me the config if you want to make a quick fix on your cluster, thx
from azurefile-csi-driver.
We're deployed in WestEU and usually we're later on the rollout roadmap. I'm trying to puzzle out what release this change would be in. Currently we're on v20240513 . v20240609 https://github.com/Azure/AKS/releases/tag/2024-06-09 is being rollout atm in westeu but this does not yet contain this fix. A new release had not been announced yet but usually if there is no release announced it will take atleast 4-6 weeks.
The csi driver v1.30.2 was introduced in https://github.com/Azure/AKS/releases/tag/2024-05-13
I'm unsure what config you would like me to send. Currently our cluster is on Kuberntes version 1.29.4 AKS v2024051 but we have no way off choosing the csi-driver version during creation or a update command as for as I'm aware. I will raise this question at Microsoft support given AKS is a managed service. But am afraid its pinned tothe next vxxxxxx version that includes this driver version.
from azurefile-csi-driver.
@Rob19999 the azure file csi driver is managed by aks, and we have a way in backend to ping your csi driver version to the fixed patch version if you want, otherwise you need to wait a few weeks.
from azurefile-csi-driver.
I would love that given the issues we have. We're more then willing to test the new version for you
. If that still causes issues we could always go back to v.1.30.1.
I assume I can raise a support request for this through the Microsoft portal?
from azurefile-csi-driver.
I would love that given the issues we have. We're more then willing to test the new version for you . If that still causes issues we could always go back to v.1.30.1.
I assume I can raise a support request for this through the Microsoft portal?
@Rob19999 that also works but it would go through a process and takes time.
from azurefile-csi-driver.
I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.
from azurefile-csi-driver.
I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.
nvm, I got your cluster now: aks-prd-rgxxxc5.hcp.westeurope.azmk8s.io, and if you want to mitigate other clusters, just email me, thx
from azurefile-csi-driver.
Thank you. I will give it some time to propagate. A new node in a existing nodepool I added still pulled mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.2-windows-hp.
Will the pinning of the version disappear when we upgrade the cluster or do we need to reach out to you to get this changes?
from azurefile-csi-driver.
Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.
Am I in the right assumption we just need to wait for 1.29.5 te become available.
from azurefile-csi-driver.
Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.
Am I in the right assumption we just need to wait for 1.29.5 te become available.
@Rob19999 pls check again, azurefile-csi:v1.30.3-windows-hp image is deployed on your cluster now.
from azurefile-csi-driver.
I see it now. Thanks for the support.
from azurefile-csi-driver.
Goodday, We had the driver running for 3 days now and unfortunately we're still experiencing the same issue.
To test the issue in a more controlled manner we split up our deployment over several nodes pools and start making changes in the way we use PVC's. What did seem to help is fasing out dynamic pvc's we create with each deployment and creating 1 static (premium storage) pvc a day and use that instead. While this is workable and we havent had issues on this pool for over a week it is not how it should work.
Our generic load is around 150 -175 HELM release a day (deploy/delete) with 1400-2000 deployments mostly having 1 pod. where each release has it own dynamic pvc where the pods have a persistentVolumeClaim to the azurefile-csi with 1Gi storage and some deploys having 30Gi or 100Gi. With the pvc change we caused around 40% off this load to go to a different node pool that is stable for around a week now. The other nodepool still had 3-4 nodes a day dying off.
We also tested with smaller pods counts on nodes (85 pods), (65 pods) etc. This does not seem to lessen the issue.
We're now working on changing all our workload to use as less pvc's as possible that we pre-create each day. Other pvc's where already more permanent.
While we have a workaround now I would still like to assist Microsoft in a more permanent fix. Not just for our workload but also possible other/future customers of AKS.
Is there anything we can do to see if we can resolve the issue in the driver? I can imagen we cause a big load with our setup but I also feel Windows should be able to handle this. Given its relying an basic SMB functionality that also work on very file server where these numbers of smb connections are not very large.
Thank in advance.
from azurefile-csi-driver.
@Rob19999 could you share the csi driver logs on that node again? what's the csi driver version on that node? and how many PVs are mounted on that node in total?
from azurefile-csi-driver.
Needed to wait for a crash, here is the information. If its easier we can also go on a call sometime. We can save a node for investigation. I redacted all relevant cluster information. To be sure is there a way to mark this message as internal?
I think the user already removed some off the pods after I ran this command.
kubectl exec -n kube-system -it csi-azurefile-node-win-n6dgd -- powershell
get-smbglobalmapping
Disconnected \\<redacted>.file.core.windows.net\pvc-44150571-38f1-458e-b438-7919a8353018
Disconnected \\<redacted>.file.core.windows.net\pvc-a02a46f7-7325-4a51-bbd5-138dae704523
OK \\<redacted>.file.core.windows.net\pvc-9a3cacab-3290-42e2-982c-6555c6587df2
Disconnected \\<redacted>.file.core.windows.net\pvc-7e437168-3a59-4801-b8ab-5f5261e7d29d
Disconnected \\<redacted>.file.core.windows.net\pvc-36f51116-7143-4454-a7f9-0062a08b3e29
Disconnected \\<redacted>.file.core.windows.net\pvc-d286d91a-7eab-4965-acb1-d45452bce160
OK \\<redacted>.file.core.windows.net\pvc-81161d5d-28b5-46f7-bdd7-9cc119496b25
OK \\<redacted>.file.core.windows.net\pvc-105a612d-2207-4202-8fe2-f452649159a5
Disconnected \\<redacted>.file.core.windows.net\pvc-71aad805-b3a4-4c3d-baff-a7cdfaeafdac
Disconnected \\<redacted>.file.core.windows.net\pvc-3c7f8f21-eb63-4460-9c6a-d6cbac6582dd
Disconnected \\<redacted>.file.core.windows.net\pvc-558def79-d041-4263-a49c-8c021d679f96
Disconnected \\<redacted>.file.core.windows.net\pvc-306793ef-5728-43dd-9ad5-fc7997c5c328
kubectl describe pod csi-azurefile-node-win-mzb8k -n kube-system
Name: csi-azurefile-node-win-mzb8k
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: csi-azurefile-node-sa
Node: aksnpwin200000f/10.241.13.203
Start Time: Mon, 08 Jul 2024 17:37:34 +0200
Labels: app=csi-azurefile-node-win
controller-revision-hash=7b6b5fd8fb
kubernetes.azure.com/managedby=aks
pod-template-generation=36
Annotations: <none>
Status: Running
SeccompProfile: RuntimeDefault
IP: 10.241.13.203
IPs:
IP: 10.241.13.203
Controlled By: DaemonSet/csi-azurefile-node-win
Init Containers:
init:
Container ID: containerd://77ab7ccdef01617fb3121f003ea19ead88691fbbea44c4ed18ae0242811fade1
Image: mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
Image ID: mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
Port: <none>
Host Port: <none>
Command:
powershell.exe
-c
New-Item -ItemType Directory -Path C:\var\lib\kubelet\plugins\file.csi.azure.com\ -Force
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 09 Jul 2024 08:39:08 +0200
Finished: Tue, 09 Jul 2024 08:39:09 +0200
Ready: True
Restart Count: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Containers:
node-driver-registrar:
Container ID: containerd://734ff0ccfb7db13d1db6a8d24f49563ff9a020505e6571b85a75da2ef0ec1426
Image: mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.10.1
Image ID: mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar@sha256:b3bbd7a6171bff15eeefd137316fa16415aa6a4c817e5ec609662793093b3526
Port: <none>
Host Port: <none>
Command:
csi-node-driver-registrar.exe
Args:
--csi-address=$(CSI_ENDPOINT)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
--plugin-registration-path=$(PLUGIN_REG_DIR)
--v=2
State: Running
Started: Tue, 09 Jul 2024 08:39:14 +0200
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 08 Jul 2024 17:37:47 +0200
Finished: Tue, 09 Jul 2024 08:38:54 +0200
Ready: True
Restart Count: 1
Limits:
memory: 150Mi
Requests:
cpu: 40m
memory: 40Mi
Environment:
KUBERNETES_SERVICE_HOST: <redacted>.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://<redacted>.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://<redacted>.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP_ADDR: <redacted>.hcp.westeurope.azmk8s.io
CSI_ENDPOINT: unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
DRIVER_REG_SOCK_PATH: C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
PLUGIN_REG_DIR: C:\\var\\lib\\kubelet\\plugins_registry\\
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
azurefile:
Container ID: containerd://b8b3ab85cd2c5efb74eef06039d88fb97130272f0725c4ab43fc2969ee70213c
Image: mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
Image ID: mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
Port: <none>
Host Port: <none>
Command:
azurefileplugin.exe
Args:
--v=5
--endpoint=$(CSI_ENDPOINT)
--nodeid=$(KUBE_NODE_NAME)
--enable-windows-host-process=true
State: Running
Started: Tue, 09 Jul 2024 08:39:15 +0200
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 08 Jul 2024 17:37:50 +0200
Finished: Tue, 09 Jul 2024 08:38:54 +0200
Ready: True
Restart Count: 1
Environment:
KUBERNETES_SERVICE_HOST: <redacted>.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://<redacted>.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://<redacted>.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP_ADDR: <redacted>.hcp.westeurope.azmk8s.io
AZURE_CREDENTIAL_FILE: C:\k\azure.json
CSI_ENDPOINT: unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-ws59v:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
CriticalAddonsOnly op=Exists
Events: <none>
Logging:
See attachment. The node seem to have died around 13:00 although I am seeing timeout couple hours before that.
csi-azurefile-node-win-mzb8.log
from azurefile-csi-driver.
Hello,
Colleague of @Rob19999 here. What we notice when a node is going "dead" is that the command Get-SMBGlobalMapping takes a long time to response or doesn't even response at all. Do you know if the WMI part that powershell uses does some sort of locking on the node? Because that would explain the seemingly random "timeouts" we see in the logging.
from azurefile-csi-driver.
I have disabled --remove-smb-mount-on-windows=false
in your cluster, could you check again? thx
from azurefile-csi-driver.
Currently we're outside of working hours so not much is happening on the cluster. The --remove-smb-mount-on-windows was added after we made a support request (support ticket: 2403070050001502) at Microsoft to resolve the nodes breaking after a while. Usually after 14 days or so or when it reached around 701 smbglobalshares.
Back then we got the error below . I will see if it returns or if I can force it with a certain amount off deploys. Given the change we made on our end by reducing pvc mounts it would be harder to reach this amount.
MountVolume.MountDevice failed for volume 'pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14' : rpc error: code = Internal desc = volume(mc_<redacted>-<redacted>_westeurope#fc7a964cdab3c4c3abd74c7#pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14###<redacted>) mount \\\\<redacted>.file.core.windows.net\\pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14 on \\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\file.csi.azure.com\\e0cef412d93e12842f12522496d30f960b5440e97bb4bc578a01da37d85dd7a1\\globalmount failed with NewSmbGlobalMapping failed. output: 'New-SmbGlobalMapping : Not enough memory resources are available to process this command. \\r\\nAt line:1 char:190\\r\\n+ ... ser, $PWord;New-SmbGlobalMapping -RemotePath $Env:smbremotepath -Cred ...\\r\\n+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\r\\n + CategoryInfo : ResourceUnavailable: (MSFT_SmbGlobalMapping:ROOT/Microsoft/...mbGlobalMapping) [New-SmbG \\r\\n lobalMapping], CimException\\r\\n + FullyQualifiedErrorId : Windows System Error 8,New-SmbGlobalMapping\\r\\n \\r\\n', err: exit status 1
from azurefile-csi-driver.
@Rob19999 so do you want to keep remove-smb-mount-on-windows
feature or not? create 701 smbglobalshares on one node is crazy, will try to find out how to improve remove-smb-mount-on-windows
feature to use less resources, that would take time.
from azurefile-csi-driver.
Lets keep the remove-smb-mount-on-window disabled for now. Without this functionality it was more stable for us. We will monitor the number off connections on the nodes and when they reach 600 we will remove them.
Just to make clear given you mentioned resources we got the 'New-SmbGlobalMapping : Not enough memory resources are available to process this command' error on the v.1.30.0 where the remove-smb-mount-on-window was not yet implemented. This error starting showing up when we reached the 701 smbglobalshares. Most of the 701 connections where in a disconnected state back then do to them not being removed.
I can imagen that not many clusters reach 701 smbglobalshares. It depends on how often you upgrade your node images if they're always running. But it can create seemingly random node crashes.
from azurefile-csi-driver.
We still have nodes crashing atm with smb errors. Yesterday one and today one. Its a lot better without the remove-smb-mount. We get this error without hitting the 500 smb mounts. I don't expect anything from your but I just want to provide you with as much information as possible. See full logs in the attachment.
The first error happen at : ```
06:05:49.377083 7780 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(##pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0###--suite) mount \.file.core.windows.net\pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0 on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8b7e8149f0c5af6863b83e652d281b9132a12a9ab63e55ad56bcbf18d14d2760\globalmount failed with NewSmbGlobalMapp failed. output: "", err: exit status 0xc0000142
Please refer to http://aka.ms/filemounterror for possible causes and solutions for mount errors.
[azurefile.log](https://github.com/user-attachments/files/16264155/azurefile.log)
from azurefile-csi-driver.
Related Issues (20)
- csi-azurefile-controller pod constantly restarts HOT 4
- Azure file mount failed in AKS having storage account in different subscription HOT 3
- New 1.30 patch release with commits after 2/22/2024 HOT 1
- No helm chart for release v1.30.1
- cifs credentials appear in process table HOT 3
- Frequent controller restarts HOT 6
- remove smb-globalmount when azure file is unmounted on windows node
- PVC fails to be provision HOT 9
- Add update strategy in helm chart
- Move helm chart version to strict SemVer 2 HOT 4
- Allow PV Finalizer to be removed HOT 2
- Invalid parameter "clientID" in storage class HOT 1
- PVC using Bring Your Own share in a StorageClass throws error HOT 1
- PVC+PV using Bring Your Own Share throws warning when changing capacity request + capacity not updated HOT 1
- azcopy can't resume cloning job if controller pod is killed during job run HOT 2
- Volume ID already exists issue with statically provisioned Azure NFS File Shares HOT 1
- incorrect status.creationTime value in snapshot creation with useDataPlaneAPI: “true” HOT 1
- TestNodePublishVolumeIdempotentMount failure when upgrade k8s.io/mount-utils from v0.29.4 to v0.30.2 HOT 1
- Migrate test-infra jobs to community infrastructure
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azurefile-csi-driver.