Git Product home page Git Product logo

container-engine-accelerators's Introduction

Hardware Accelerators in GKE

This repository is a collection of installation recipes and integration utilities for consuming Hardware Accelerators in Google Kubernetes Engine.

This is not an official Google product.

More details on the nvidia-gpu-device-plugin are here.

The official instructions for using GPUs on GKE are here.

container-engine-accelerators's People

Contributors

arueth avatar ashaltu avatar aston-github avatar chardch avatar crystalzhaizhai avatar dependabot[bot] avatar farbodmg avatar foxish avatar grac3gao avatar grac3gao-zz avatar jiaqicao257 avatar jiayingz avatar jtyr avatar kaoet avatar karan avatar kyewei avatar linxiulei avatar melody789 avatar pradvenkat avatar richardsliu avatar rochesterinnyc avatar rohitagarwal003 avatar ruiwen-zhao avatar samuelkarp avatar tangenti avatar thebinaryone1 avatar vishh avatar wizard-cxy avatar yguo0905 avatar ymirkhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

container-engine-accelerators's Issues

init container error on v1.15.4-gke.15

nvidia-driver-installer init container fails.
OS: COS
uses latest config:
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

When using above code from September 22, 2019, everything works fine, thus error is related to the latest 2 commits from the last month. Logs:

  • COS_DOWNLOAD_GCS=https://storage.googleapis.com/cos-tools
  • COS_KERNEL_SRC_GIT=https://chromium.googlesource.com/chromiumos/third_party/kernel
  • COS_KERNEL_SRC_ARCHIVE=kernel-src.tar.gz
  • TOOLCHAIN_URL_FILENAME=toolchain_url
  • TOOLCHAIN_ARCHIVE=toolchain.tar.xz
  • TOOLCHAIN_ENV_FILENAME=toolchain_env
  • TOOLCHAIN_PKG_DIR=/build/cos-tools
  • CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
  • ROOT_OS_RELEASE=/root/etc/os-release
  • KERNEL_SRC_DIR=/build/usr/src/linux
  • NVIDIA_DRIVER_VERSION=418.67
  • NVIDIA_DRIVER_MD5SUM=
  • NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
  • NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
  • ROOT_MOUNT_DIR=/root
  • CACHE_FILE=/usr/local/nvidia/.cache
  • LOCK_FILE=/root/tmp/cos_gpu_installer_lock
  • LOCK_FILE_FD=20
  • set +x
    [INFO 2019-10-29 10:11:43 UTC] PRELOAD: false
    [INFO 2019-10-29 10:11:43 UTC] Checking if this is the only cos-gpu-installer that is running.
    [INFO 2019-10-29 10:11:43 UTC] Running on COS build id 12371.57.0
    [INFO 2019-10-29 10:11:43 UTC] Checking if third party kernel modules can be installed
    [INFO 2019-10-29 10:11:43 UTC] Checking cached version
    [INFO 2019-10-29 10:11:43 UTC] Cache file /usr/local/nvidia/.cache not found.
    [INFO 2019-10-29 10:11:43 UTC] Did not find cached version, building the drivers...
    [INFO 2019-10-29 10:11:43 UTC] Downloading GPU installer ...
    /usr/local/nvidia /
    [INFO 2019-10-29 10:11:43 UTC] Downloading from https://storage.googleapis.com/nvidia-drivers-us-public/tesla/418.67/NVIDIA-Linux-x86_64-418.67.run
    /
    ls: cannot access '/build/usr/src/linux': No such file or directory
    [INFO 2019-10-29 10:11:45 UTC] Kernel sources not found locally, downloading
    [INFO 2019-10-29 10:11:45 UTC] Kernel source archive download URL: https://storage.googleapis.com/cos-tools/12371.57.0/kernel-src.tar.gz
    /build/usr/src/linux /
    /

real 0m1.351s
user 0m0.196s
sys 0m0.409s
/build/usr/src/linux /
/
[INFO 2019-10-29 10:11:53 UTC] Setting up compilation environment
[INFO 2019-10-29 10:11:53 UTC] Obtaining toolchain_env file from https://storage.googleapis.com/cos-tools/12371.57.0/toolchain_env

real 0m0.027s
user 0m0.015s
sys 0m0.004s
[INFO 2019-10-29 10:11:53 UTC] /build/cos-tools: bin
lib
toolchain.tar.xz
usr
[INFO 2019-10-29 10:11:53 UTC] Found existing toolchain package. Skipping download and installation
[INFO 2019-10-29 10:11:53 UTC] Configuring environment variables for cross-compilation
[INFO 2019-10-29 10:11:53 UTC] Configuring installation directories
/usr/local/nvidia /
[INFO 2019-10-29 10:11:53 UTC] Updating container's ld cache
/
[INFO 2019-10-29 10:11:53 UTC] Configuring kernel sources
/build/usr/src/linux /
/bin/sh: 1: x86_64-cros-linux-gnu-clang: Permission denied
HOSTCC scripts/basic/fixdep
/bin/sh: 1: x86_64-cros-linux-gnu-clang: Permission denied
HOSTCC scripts/kconfig/conf.o
YACC scripts/kconfig/zconf.tab.c
LEX scripts/kconfig/zconf.lex.c
HOSTCC scripts/kconfig/zconf.tab.o
HOSTLD scripts/kconfig/conf
scripts/kconfig/conf --olddefconfig Kconfig
./scripts/gcc-version.sh: 26: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 27: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 29: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 26: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 27: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-version.sh: 29: ./scripts/gcc-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
init/Kconfig:17: syntax error
init/Kconfig:16: invalid option
./scripts/clang-version.sh: 15: ./scripts/clang-version.sh: x86_64-cros-linux-gnu-clang: Permission denied
./scripts/gcc-plugin.sh: 11: ./scripts/gcc-plugin.sh: x86_64-cros-linux-gnu-clang: Permission denied
make[1]: *** [olddefconfig] Error 1
scripts/kconfig/Makefile:69: recipe for target 'olddefconfig' failed
make: *** [olddefconfig] Error 2
Makefile:529: recipe for target 'olddefconfig' failed

Allow nvidia driver installer daemonsets for cos and ubuntu to co-exist

For clusters having both cos and ubuntu based nodes, it is needed to install both the daemonsets for ubuntu and cos, but with the current setup, only either of them can be installed.

Proposal to allow both the daemonset to be installed -

Make the changes below to preloaded daemonsets https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml and https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

-> replace the name and labels from "nvidia-driver-installer" to "nvidia-driver-installer-{cos/ubuntu}"

-> add an additional node selector term

nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
              - key: cloud.google.com/gke-os-distribution
                operator: In
                values:
                  - {ubuntu/cos}

I will be happy to create to a PR if this sounds good.

unknown field "spec.template.spec.initContainers[0].volumeMounts[5].hostPath

Hello.

This daemonset is invalid:
https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml#L106

maxpain@Maksims-MacBook-Pro-4 ~ % kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
Error from server (BadRequest): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml": DaemonSet in version "v1" cannot be handled as a DaemonSet: strict decoding error: unknown field "spec.template.spec.initContainers[0].volumeMounts[5].hostPath"

Kernel Download Fails for Pod Running on Ubuntu 18.04

Hi,

I've been running a bare-metal deployment of Kubernetes through Kubespray, which utilises the ubuntu-nvidia-driver-installer container for GPU-enabled nodes. All of the underlying machines are running Ubuntu 18.04/Bionic, which has been working well so far.

However with the recent release of kernel 4.15.0-44-generic for Ubuntu 18.04, the driver installer init container fails as it cannot locate the linux-headers-4.15.0-44-generic package in the 16.04/Xenial repositories. It looks like at the moment, the Xenial repositories only have 4.15.0-43 available.

For deployments on to 18.04, is it possible to use a container image built from 18.04 as well?

Many thanks.

+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename http://us.download.nvidia.com/XFree86/Linux-x86_64/415.27/NVIDIA-Linux-x86_64-415.27.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-415.27.run
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
++ uname -r
+ KERNEL_VERSION=4.15.0-44-generic
+ set +x
Checking cached version
Cache file /usr/local/nvidia/.cache found but existing versions didn't match.
Downloading kernel sources...
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:3 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:5 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:7 http://security.ubuntu.com/ubuntu xenial-security/universe Sources [118 kB]
Get:8 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [767 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [305 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [1167 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [931 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [19.1 kB]
Get:17 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:18 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [529 kB]
Get:19 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [6119 B]
Get:20 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [7942 B]
Get:21 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [8532 B]
Fetched 25.8 MB in 6s (3702 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-headers-4.15.0-44-generic
E: Couldn't find any package by glob 'linux-headers-4.15.0-44-generic'
E: Couldn't find any package by regex 'linux-headers-4.15.0-44-generic'

dependency on k8s.io/kubernetes cause build failure

My application is depending upon github.com/GoogleCloudPlatform/container-engine-accelerators/pkg/gpu/nvidia/metrics, and k8s.io/kubernetes is in its indirect dependencies. Because of it, go build command fails with the following messages.

go build
go: finding module for package github.com/GoogleCloudPlatform/container-engine-accelerators/pkg/gpu/nvidia/metrics
go: found github.com/GoogleCloudPlatform/container-engine-accelerators/pkg/gpu/nvidia/metrics in github.com/GoogleCloudPlatform/container-engine-accelerators v0.0.0-20201215190136-13a0dea71c2e
go: github.com/GoogleCloudPlatform/[email protected] requires
	k8s.io/[email protected] requires
	k8s.io/[email protected]: reading k8s.io/api/go.mod at revision v0.0.0: unknown revision v0.0.0

The issue is already reported here (kubernetes/kubernetes#90358) but hopefully someone can conduct the workaround introduced in the issue thread.

Can't install NVIDIA Drivers using DaemonSet

I'm trying to follow the instructions to install the nvidia GPU drivers (tutorial here that refers to the Google Cloud documentation)

it should be very easy, the following command should install automatically NVIDIA's drivers for you:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Instead, I get this error:

error: Missing or incomplete configuration info.  Please point to an existing, complete config file:

  1. Via the command-line flag --kubeconfig
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.

I tried to copy the code locally into the file ~/.kube/config/daemonset-preloaded.yam and to use that but I had no luck.
using the command
kubectl --kubeconfig="~/.kube/config" apply -f daemonset-preloaded.yaml
gave me
error: stat ~/.kube/config: no such file or directory

Does someone know how to solve the problem?
Many thanks in advance

Help! Do I need install driver when machine reboot?

I try using this device plugin in my cluster and it works fine at the first. I run the driver-installer like docker run --rm installer -v -e ... only once instead of daemonset. And I found it when my machine reboot, nvidia-smi got error like NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Do I must run driver-installer container as a daemonset to ensure driver works normal?

Error occurred in CUDA_CALL: 35 after daemonset created.

After creating the daemonset according to https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ without error. I manually pulled a docker container containing our gpu-centric application and tried running it to verify it's ability to interact with the gpu. It failed with 'Error occurred in CUDA_CALL: 35'

I'm not certain how to ensure that the amd/nvidia driver / libraries are installed. Please advise.

I found the following running on the cluster nodes:

docker ps | grep -i nvid
9313f94c7a91 c6bf69abba08 "/usr/bin/nvidia-g..." 17 hours ago Up 17 hours
k8s_nvidia-gpu-device-plugin_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
a6ff9582a414 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
122ef629bc2d 2b58359142b0 "/pause" 17 hours ago Up 17 hours
k8s_pause_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1
e80b741c1d72 a8fd6d7f4414 "nvidia-device-plugin" 17 hours ago Up 17 hours
k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
53e3b403ef15 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
ce99e7e6536e k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1

nvidia-driver-installer crash loop during GKE scale ups

We've been using the nvidia-driver-installer on Ubuntu node groups via GKE v1.15 per the official How-to GPU instructions specified here.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

The daemonset deployed via daemonset-preloaded.yaml appeared to work correctly for some time, however we started noticing issues last Friday when new nodes were added to the node group via cluster autoscaling. The nvidia-driver-installer daemonset pods that were scheduled to these new nodes began to crash loop, as their initContainers were exiting with non-zero exit codes.

Upon examining pod logs, it appears that the failed pods contain the following lines as their last output before exiting.

Verifying Nvidia installation... DONE. 
ln: /root/home/kubernetes/bin/nvidia: cannot overwrite directory

See here for full log output from one of the failed pods.

I've logged into one of the nodes and manually removed the /root/home/kubernetes/bin/nvidia folder (which is presumably created by the very first instance of the nvidia-driver-installer pod scheduled to a node when it comes up) but the folder re-appears and the daemonset pods continue to crash in loop. Nodes that have daemonset pods in this state don't have the drivers correctly installed, and jobs that require them fail to import CUDA due to driver issues.

We've been experiencing this issue for 4 days now with nodes that receive live production traffic. Not every node that scales up experiences this problem, but most do. If a node comes up and its nvidia-driver-installer pod begins to crash, we've had no luck bringing it out of that state. Instead we've manually marked the node as unschedulable and brought it down, hoping the next to come up won't experience the same problem.

From our perspective, nothing has changed with our cluster configuration, node group configuration, or K8s manifests that would cause this issue to start occurring. We did experience something similar for a few hours in mid December, but the issue resolved itself within a few hours and we didn't think much of it. I'm happy to provide more logs or detailed information about the errors upon request!

Any thoughts about what could be causing this?

Installer freezes node on updating_container_ld_cache

Whenever I run this on an ubuntu node the node stops responding to ssh never becomes available for gpu workloads.

This is what I see in the logs
`kubectl logs nvidia-driver-installer-6vnpd -n kube-system -c nvidia-driver-installer -f

nvidia-gpu daemonset using hostNetworking

Copied from kubernetes/kubernetes#62357, probably a better place to open here.

Host Networking is enabled for the nvidia-gpu daemonset. Trying to understand the purpose of using it.
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L30

Disabling hostNetworking seemed to work and GPU workloads are able to be scheduled to the cluster.

How to reproduce it (as minimally and precisely as possible):
Remove this line and apply the daemonset to a GPU enabled cluster.
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml#L30

Anything else we need to know?:
I would like to understand the purpose for using hostNetworking for the nvidia-gpu daemonset. It appears nvidia implementation does not use it either: https://github.com/NVIDIA/k8s-device-plugin/blob/v1.10/nvidia-device-plugin.yml

Metrics export broken due to device naming mismatch

We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:

GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

I suspect it may be an issue with device names (nvidia0 vs nvidia0/gi0), but I'm not entirely sure.

Full logs:

➜  ~ k -n kube-system logs nvidia-gpu-device-plugin-4rcj2
I0518 15:07:21.825159       1 nvidia_gpu.go:75] device-plugin started
I0518 15:07:21.825230       1 nvidia_gpu.go:82] Reading GPU config file: /etc/nvidia/gpu_config.json
I0518 15:07:21.825362       1 nvidia_gpu.go:91] Using gpu config: {7g.40gb 0 { 0} []}
E0518 15:07:27.545855       1 nvidia_gpu.go:117] failed to start GPU device manager: failed to start mig device manager: Number of partitions (0) for GPU 0 does not match expected partition count (1)
I0518 15:07:32.547969       1 mig.go:175] Discovered GPU partition: nvidia0/gi0
I0518 15:07:32.549461       1 nvidia_gpu.go:122] Starting metrics server on port: 2112, endpoint path: /metrics, collection frequency: 30000
I0518 15:07:32.550354       1 metrics.go:134] Starting metrics server
I0518 15:07:32.550430       1 metrics.go:140] nvml initialized successfully. Driver version: 470.161.03
I0518 15:07:32.550446       1 devices.go:115] Found 1 GPU devices
I0518 15:07:32.556369       1 devices.go:126] Found device nvidia0 for metrics collection
I0518 15:07:32.556430       1 health_checker.go:65] Starting GPU Health Checker
I0518 15:07:32.556440       1 health_checker.go:68] Healthchecker receives device nvidia0/gi0, device {nvidia0/gi0 Healthy nil {} 0}+
I0518 15:07:32.556475       1 health_checker.go:77] Found 1 GPU devices
I0518 15:07:32.556667       1 health_checker.go:145] HealthChecker detects MIG is enabled on device nvidia0
I0518 15:07:32.560599       1 health_checker.go:164] Found mig device nvidia0/gi0 for health monitoring. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561030       1 health_checker.go:113] Registering device /dev/nvidia0. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561195       1 manager.go:385] will use alpha API
I0518 15:07:32.561206       1 manager.go:399] starting device-plugin server at: /device-plugin/nvidiaGPU-1684422452.sock
I0518 15:07:32.561393       1 manager.go:426] device-plugin server started serving
I0518 15:07:32.564986       1 beta_plugin.go:40] device-plugin: ListAndWatch start
I0518 15:07:32.565040       1 manager.go:434] device-plugin registered with the kubelet
I0518 15:07:32.565003       1 beta_plugin.go:138] ListAndWatch: send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:nvidia0/gi0,Health:Healthy,Topology:nil,},},}
E0518 15:08:02.557919       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:02.643081       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:08:32.557732       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:32.683032       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

Downstream issue: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/76.

Pod Unschedulable

I am getting two errors after deploying my object detection model for prediction using GPUs:
1.PodUnschedulable Cannot schedule pods: Insufficient nvidia
2.PodUnschedulable Cannot schedule pods: com/gpu.

I have followed this post for deploying of my prediction model:
https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md
and this one for installing nvidia drives to my nodes.

I haven't used nvidia-docker.
This is the output of the kubectl describe pods command:

ame:           xyz-v1-5c5b57cf9c-ltw9m
Namespace:      default
Node:           <none>
Labels:         app=xyz
                pod-template-hash=1716137957
                version=v1
Annotations:    <none>
Status:         Pending
IP:             
Controlled By:  ReplicaSet/xyz-v1-5c5b57cf9c
Containers:
  xyz:
    Image:      tensorflow/serving:1.11.1-gpu
    Port:       9000/TCP
    Host Port:  0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --model_name=xyz
      --model_base_path=gs://xyz_kuber_app-xyz-identification/export/
    Limits:
      cpu:             4
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  xyz-http-proxy:
    Image:      gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
    Port:       8000/TCP
    Host Port:  0/TCP
    Command:
      python
      /usr/src/app/server.py
      --port=8000
      --rpc_port=9000
      --rpc_timeout=10.0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        500m
      memory:     500Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-b6dpn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b6dpn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason             Age                       From                Message
  ----     ------             ----                      ----                -------
  Warning  FailedScheduling   3m57s (x1276 over 6h17m)  default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient nvidia.com/gpu.
  Normal   NotTriggerScaleUp  3m20s (x8562 over 26h)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

The output of kubectl describe pods | grep gpu is :

    Image:      tensorflow/serving:1.11.1-gpu
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
                 nvidia.com/gpu:NoSchedule
  Warning  FailedScheduling   48s (x1276 over 6h14m)  default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient nvidia.com/gpu.

I haven't used nvidia-docker. Although, the kubectl get pods -n=kube-system command gives me:

NAME                                                      READY   STATUS                  RESTARTS   AGE
event-exporter-v0.2.3-54f94754f4-vd9l5                    2/2     Running                 0          1d
fluentd-gcp-scaler-6d7bbc67c5-m8gt6                       1/1     Running                 0          1d
fluentd-gcp-v3.1.0-4wnv9                                  2/2     Running                 0          1d
fluentd-gcp-v3.1.0-x5cdv                                  2/2     Running                 0          22h
heapster-v1.5.3-75bdcc556f-8z4x8                          3/3     Running                 0          1d
kube-dns-788979dc8f-59ftr                                 4/4     Running                 0          1d
kube-dns-788979dc8f-rm9lk                                 4/4     Running                 0          1d
kube-dns-autoscaler-79b4b844b9-9xg69                      1/1     Running                 0          1d
kube-proxy-gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7   1/1     Running                 0          22h
kube-proxy-gke-kuberflow-xyz-pool-1-57d75875-8f88     1/1     Running                 0          1d
l7-default-backend-75f847b979-2plm4                       1/1     Running                 0          1d
metrics-server-v0.2.1-7486f5bd67-mj99g                    2/2     Running                 0          1d
nvidia-driver-installer-zxn2m                             0/1     Init:CrashLoopBackOff   5          8m
nvidia-gpu-device-plugin-hc7h8                            1/1     Running                 0          22h

Looks like an issue with nvidia driver installer.

This is the nvidia driver installer log. Describing the pod: kubectl describe pods nvidia-driver-installer-zxn2m -n=kube-system

   Name:           nvidia-driver-installer-zxn2m
Namespace:      kube-system
Node:           gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7/10.128.0.33
Start Time:     Sat, 16 Feb 2019 17:52:13 +0530
Labels:         controller-revision-hash=1137413470
                k8s-app=nvidia-driver-installer
                name=nvidia-driver-installer
                pod-template-generation=2
Annotations:    <none>
Status:         Pending
IP:             10.36.3.3
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://09f8c44650c6180f2257d37d1b0922116ed4cc77838d7b6a0fc1d951306cf76b
    Image:          gke-nvidia-installer:fixed
    Image ID:       docker-pullable://gcr.io/cos-cloud/cos-gpu-installer@sha256:e7bf3b4c77ef0d43fedaf4a244bd6009e8f524d0af4828a0996559b7f5dca091
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Sat, 16 Feb 2019 18:01:42 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    32
      Started:      Sat, 16 Feb 2019 17:58:26 +0530
      Finished:     Sat, 16 Feb 2019 17:59:01 +0530
    Ready:          False
    Restart Count:  6
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Containers:
  pause:
    Container ID:   
    Image:          gcr.io/google-containers/pause:2.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Conditions:
  Type           Status
  Initialized    False 
  Ready          False 
  PodScheduled   True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  boot:
    Type:          HostPath (bare host directory volume)
    Path:          /boot
    HostPathType:  
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  default-token-n5t8z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-n5t8z
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                 Age                     From                                                   Message
  ----     ------                 ----                    ----                                                   -------
  Normal   SuccessfulMountVolume  9m51s                   kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  MountVolume.SetUp succeeded for volume "boot"
  Normal   SuccessfulMountVolume  9m51s                   kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  MountVolume.SetUp succeeded for volume "root-mount"
  Normal   SuccessfulMountVolume  9m51s                   kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  MountVolume.SetUp succeeded for volume "dev"
  Normal   SuccessfulMountVolume  9m51s                   kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  MountVolume.SetUp succeeded for volume "default-token-n5t8z"
  Normal   Pulled                 7m1s (x4 over 9m50s)    kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  Container image "gke-nvidia-installer:fixed" already present on machine
  Normal   Created                7m1s (x4 over 9m50s)    kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  Created container
  Normal   Started                7m1s (x4 over 9m50s)    kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  Started container
  Warning  BackOff                4m44s (x12 over 8m30s)  kubelet, gke-kuberflow-xyz-gpu-pool-a25f1a36-pvf7  Back-off restarting failed container

Error log from the pod kubectl logs nvidia-driver-installer-zxn2m -n=kube-system :

Error from server (BadRequest): container "pause" in pod "nvidia-driver-installer-zxn2m" is waiting to start: PodInitializing

How can I fix this?

OutOfnvidia.com/gpu when node is restarted

Trying to run a sample replica-set that start pods with a gpu request to test our installation, we discover that if we restart the node that run the pod, the pod enter a OutOfnvidia.com/gpu state that seems to last forever.

Is this the normal behaviour when the resource is lost?

kubectl get pods -o wide | grep replicaset
gpu-replicaset-bw4gf                    1/1     Running               0          2h    10.2.2.9      controller-1.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-bxbc2                    0/1     OutOfnvidia.com/gpu   0          2h    <none>        controller-2.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-n8srl                    1/1     Running               0          2h    10.2.0.9      controller-3.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-spvwb                    1/1     Running               0          2h    10.2.1.6      controller-2.k8s.ml.prod.srcd.host   <none>

Downloading driver fails on a K8S 1.18 GKE Cluster

Using the daemonset-nvidia-v450.yaml fails due to a 403 error in a cluster with version 1.18.14-gke.1200. daemonset-preloaded.yaml works fine in an 1.17 cluster but also fails when using an 1.18 cluster.

I've only captured the log of the v450 installer:

+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.51.06
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO    2021-01-22 22:11:36 UTC] PRELOAD: false
[INFO    2021-01-22 22:11:36 UTC] Running on COS build id 13310.1041.38
[INFO    2021-01-22 22:11:36 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/13310.1041.38
[INFO    2021-01-22 22:11:36 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO    2021-01-22 22:11:36 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO    2021-01-22 22:11:36 UTC] Checking cached version
[INFO    2021-01-22 22:11:36 UTC] Cache file /usr/local/nvidia/.cache not found.
[INFO    2021-01-22 22:11:36 UTC] Did not find cached version, building the drivers...
[INFO    2021-01-22 22:11:36 UTC] Downloading GPU installer ...
/usr/local/nvidia /
[INFO    2021-01-22 22:11:37 UTC] Downloading from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/85/tesla/450_00/450.51.06/NVIDIA-Linux-x86_64-450.51.06_85-13310-1041-38.cos
[INFO    2021-01-22 22:11:37 UTC] Downloading GPU installer from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/85/tesla/450_00/450.51.06/NVIDIA-Linux-x86_64-450.51.06_85-13310-1041-38.cos
curl: (22) The requested URL returned error: 403

Support for CUDA 10.0

Looks like CUDA 10.0 requires driver version >= 410.48 and it appears the max supported by the current driver installer is 9.2 given it installs version 396.26

root@launchtest-j0227-1834-778f-master-0:~# cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  396.26  Mon Apr 30 18:01:39 PDT 2018
GCC version:  gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r193-ac6128e0a17a52f011797f33ac3e7d6273a9368d_4.9.2-r193)

and Table 1 here. Corroborated by GPUs on GKE docs mentioning max supported CUDA is 9.*.

Is it feasible to update the driver version to 410.48 in order to support all current CUDA versions?

This is especially relevant because tensorflow-gpu releases are now compiled for CUDA 10.0 by default iiuc, i.e. the version that is installed upon pip install tensorflow-gpu==1.13.1 expects CUDA 10.0 and requires rebuilding from source otherwise.

/cc kubeflow/kubeflow#2573 @jlewi

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu`

when I use NVIDIA/k8s-device-plugin in my k8s cluster
I set NVIDIA_VISIBLE_DEVICES=all in pod spec

apiVersion: v1
kind: Pod
metadata:
  name: test
  containers:
  - args:
    - -c
    - top -b
    command:
    - /bin/sh
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    image: cuda:10.2-cudnn7-devel-ubuntu18.04
    name: test
    resources:
      limits:
        cpu: 150m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 200Mi

the devices.list under /sys/fs/cgroup/devices/kubepods/burstable/podxxxxxx/xxxxxx/devices.list has all gpu deveice on this node
image

I noticed that this GCE container-engine-accelerators doesn’t require using nvidia-docker. so NVIDIA_VISIBLE_DEVICES may doesn't work.
thus, is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu ?

DevicePlugin Pod keeps terminating

Hi,

I am using a derivative of the in-tree device plugin from the kubernetes repo[0] for a CoreOS based cluster. The NVIDIA driver installation is done by using the daemonset provided by @squat [1].

I noticed now, that the kubelet is removing the device plugin pods with the following message:

kubelet_pods.go:1121] Killing unwanted pod "nvidia-gpu-device-plugin-btrpv"

After removing the label

addonmanager.kubernetes.io/mode: Reconcile

from the manifest, this issue disappears.

I am trying now to understand why this is happening. Any idea?

Thanks,
Andreas

[0] https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
[1] https://github.com/squat/modulus

Update driver version: how?

When updating the driver version (by editing the DaemonSet NVIDIA_DRIVER_VERSION environment variable for example), the nodes need a restart/reset:

  • with stable branche the update strategy is OnDelete: we need to manually delete the pods first
  • the new pods then fail: the new driver installation fails on nodes where another kernel module version is already loaded.

Example:

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.40.04................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in
       your kernel.  This may be because it is in use (for example, by an X
       server, a CUDA program, or the NVIDIA Persistence Daemon), but this
       may also happen if your kernel was configured without support for
       module unloading.  Please be sure to exit any programs that may be
       using the GPU(s) before attempting to upgrade your driver.  If no
       GPU-based programs are running, you know that your kernel supports
       module unloading, and you still receive this message, then an error
       may have occured that has corrupted an NVIDIA kernel module's usage
       count, for which the simplest remedy is to reboot your computer.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

After a node restart/reset, the installation works with the new version.

=> Maybe document somewhere how to upgrade the driver version?

Remark: since #66 (included in master, not in stable, which is documented in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) the update strategy changed from OnDelete to RollingUpdate; this avoids the need to manually delete the pods (which is required even with the node restart/reset); but the node restart/reset is still needed.

How to build the cmd/nvidia_gpu/ to a binary exec file

I have cloned this project and go get the related projects, such as golang.org/x/net/context and google.golang.org/grpc. But I got the error message as below when running "go build":

./nvidia_gpu.go:113:43: cannot use conn (type *"google.golang.org/grpc".ClientConn) as type *"k8s.io/kubernetes/vendor/google.golang.org/grpc".ClientConn in argument to deviceplugin.NewRegistrationClient

anything else do I need to setup for the GOPATH?

EGL doesn't work on kubernetes with k8s NVIDIA plugin

Trying to run CARLA simulator in offscreen/EGL mode on GKE but failed.
Simulator uses nvidia/opengl image.

It runs correctly on nvidia-docker2 where drivers are mounted in /usr/lib/x86_64-linux-gnu. But on GKE drivers by default are attached by nvidia-plugin in /usr/local/nvidia/lib64.

Shall be new directories with library added to LD_LIBRARY_PATH of newly started container?

Unable to deploy nvidia-driver-installer daemonSet via terraform

Hi, we are willing to use terraform to deploy nvidia-driver-installer,

resource "kubernetes_manifest" "nvidia_driver_installer" {
  manifest = yamldecode(file("../nvidia-driver-installer.yaml"))
}

But got an inconsistent result error after applying, since cpu requests values for init containers were serialized from 0,15 to 150m.

╷
│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to kubernetes_manifest.nvidia_driver_installer,
│ provider "provider[\"registry.terraform.io/hashicorp/kubernetes\"]"
│ produced an unexpected new value:
│ .object.spec.template.spec.initContainers[0].resources.requests["cpu"]: was
│ cty.StringVal("0.15"), but now cty.StringVal("150m").
│ 
│ This is a bug in the provider, which should be reported in the provider's
│ own issue tracker.
╵
╷
│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to kubernetes_manifest.nvidia_driver_installer,
│ provider "provider[\"registry.terraform.io/hashicorp/kubernetes\"]"
│ produced an unexpected new value:
│ .object.spec.template.spec.initContainers[1].resources.requests["cpu"]: was
│ cty.StringVal("0.15"), but now cty.StringVal("150m").
│ 
│ This is a bug in the provider, which should be reported in the provider's
│ own issue tracker.

Could you kindly change these values in manifests from 0.15 -> 150m to avoid this behavior?
We could do it on our side, but it seems like it would be better to do it in the original repo.
Thanks!

Instructions for GKE ubuntu

I am trying to use this project with GKE node with --image-type=ubuntu, but it doesn't work out of the box. Any pointers how to make it work?

What I tried so far:

Start GKE cluster and ssh to the instance

gcloud beta container clusters create \
  --accelerator=type=nvidia-tesla-k80,count=1 \
  --zone=$CLUSTER_ZONE \
  --num-nodes=1 \
  --cluster-version=1.9.2-gke.1 \
  --machine-type=n1-standard-8 \
  --image-type=ubuntu \
  --scopes=https://www.googleapis.com/auth/devstorage.read_write \
  $CLUSTER_NAME

Install GPU libraries on the instance

I followed installation steps from https://cloud.google.com/compute/docs/gpus/add-gpus

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
apt-get update
apt-get install cuda-8-0 -y

Libraries seemd to be installed in /usr/lib/nvidia-390

root@dev-kozikow-instance:/usr/lib/nvidia-390# ls /usr/lib/nvidia-390
alt_ld.so.conf             libGL.so                       libGLX_nvidia.so.0            libnvidia-egl-wayland.so.1           libnvidia-ifr.so.390.30
alternate-install-present  libGL.so.1                     libGLX_nvidia.so.390.30       libnvidia-egl-wayland.so.1.0.2       libnvidia-ml.so
bin                        libGL.so.390.30                libGLdispatch.so.0            libnvidia-eglcore.so.390.30          libnvidia-ml.so.1
bin-workdir                libGLESv1_CM.so                libOpenGL.so                  libnvidia-encode.so                  libnvidia-ml.so.390.30
drivers                    libGLESv1_CM.so.1              libOpenGL.so.0                libnvidia-encode.so.1                libnvidia-ptxjitcompiler.so
drivers-workdir            libGLESv1_CM_nvidia.so.1       libnvcuvid.so                 libnvidia-encode.so.390.30           libnvidia-ptxjitcompiler.so.1
ld.so.conf                 libGLESv1_CM_nvidia.so.390.30  libnvcuvid.so.1               libnvidia-fatbinaryloader.so.390.30  libnvidia-ptxjitcompiler.so.390.30
lib64                      libGLESv2.so                   libnvcuvid.so.390.30          libnvidia-fbc.so                     libnvidia-tls.so.390.30
lib64-workdir              libGLESv2.so.2                 libnvidia-cfg.so              libnvidia-fbc.so.1                   libnvidia-wfb.so.1
libEGL.so                  libGLESv2_nvidia.so.2          libnvidia-cfg.so.1            libnvidia-fbc.so.390.30              libnvidia-wfb.so.390.30
libEGL.so.1                libGLESv2_nvidia.so.390.30     libnvidia-cfg.so.390.30       libnvidia-glcore.so.390.30           tls
libEGL.so.390.30           libGLX.so                      libnvidia-compiler.so         libnvidia-glsi.so.390.30             vdpau
libEGL_nvidia.so.0         libGLX.so.0                    libnvidia-compiler.so.1       libnvidia-ifr.so                     xorg
libEGL_nvidia.so.390.30    libGLX_indirect.so.0           libnvidia-compiler.so.390.30  libnvidia-ifr.so.1

Start driver installer daemonset

Replaced all references to /home/kubernetes/bin/nvidia by /usr/lib/nvidia-390 in https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - key: "nvidia.com/gpu"
        effect: "NoSchedule"
        operator: "Exists"
      hostNetwork: true
      hostPID: true
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: nvidia-install-dir-host
        hostPath:
          path: /usr/lib/nvidia-390
      - name: root-mount
        hostPath:
          path: /
      initContainers:
      - image: gcr.io/google-containers/ubuntu-nvidia-driver-installer@sha256:7ffaf40fcf6bcc5bc87501b6be295a47ce74e1f7aac914a9f3e6c6fb8dd780a4
        name: nvidia-driver-installer
        resources:
          requests:
            cpu: 0.5
            memory: 512Mi
        securityContext:
          privileged: true
        env:
          - name: NVIDIA_INSTALL_DIR_HOST
            value: /usr/lib/nvidia-390
          - name: NVIDIA_INSTALL_DIR_CONTAINER
            value: /usr/local/nvidia
          - name: ROOT_MOUNT_DIR
            value: /root
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: dev
          mountPath: /dev
        - name: root-mount
          mountPath: /root
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Installer gets stuck

Installer gets stuck for hours on line "Updating container's ld cache...". FWIW it also gets stuck on this line if nvidia is not installed on host.

kubectl logs -f nvidia-driver-installer-88lqp --namespace=kube-system -c nvidia-driver-installer

+ NVIDIA_DRIVER_VERSION=384.111
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALL_DIR_HOST=/usr/lib/nvidia-390
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-384.111.run
+ ROOT_MOUNT_DIR=/root
+ set +x
Downloading kernel sources...
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/universe Sources [72.8 kB]
Get:4 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [584 kB]
Get:5 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:6 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [403 kB]
Get:7 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [3486 B]
Get:8 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [240 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [951 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [760 kB]
Get:19 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [18.5 kB]
Get:20 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [5153 B]
Get:21 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7168 B]
Fetched 25.0 MB in 2s (9368 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  linux-gcp-headers-4.13.0-1007
The following NEW packages will be installed:
  linux-gcp-headers-4.13.0-1007 linux-headers-4.13.0-1007-gcp
0 upgraded, 2 newly installed, 0 to remove and 46 not upgraded.
Need to get 11.5 MB of archives.
After this operation, 84.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-gcp-headers-4.13.0-1007 all 4.13.0-1007.10 [10.7 MB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 linux-headers-4.13.0-1007-gcp amd64 4.13.0-1007.10 [726 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 11.5 MB in 1s (6909 kB/s)
Selecting previously unselected package linux-gcp-headers-4.13.0-1007.
(Reading database ... 9482 files and directories currently installed.)
Preparing to unpack .../linux-gcp-headers-4.13.0-1007_4.13.0-1007.10_all.deb ...
Unpacking linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Selecting previously unselected package linux-headers-4.13.0-1007-gcp.
Preparing to unpack .../linux-headers-4.13.0-1007-gcp_4.13.0-1007.10_amd64.deb ...
Unpacking linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Setting up linux-gcp-headers-4.13.0-1007 (4.13.0-1007.10) ...
Setting up linux-headers-4.13.0-1007-gcp (4.13.0-1007.10) ...
Downloading kernel sources... DONE.
Configuring installation directories...
/usr/local/nvidia /
Updating container's ld cache...

nvidia-gpu-device-plugin gets OOM killed

Hey folks,
In our experiments with running GPU loads over GKE over at iguazio we've hit OOM kill of the NVIDIA GPU device plugin pod during a gpu load test

$ kubectl -n kube-system describe pod nvidia-gpu-device-plugin-ngkrv | grep OOM -A15
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 27 Jul 2021 11:15:02 +0300
      Finished:     Tue, 27 Jul 2021 16:49:32 +0300
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     50m
      memory:  20Mi
    Requests:
      cpu:     50m
      memory:  20Mi
    Environment:
      LD_LIBRARY_PATH:  /usr/local/nvidia/lib64
    Mounts:
      /dev from dev (rw)

This happened on ubuntu nodes w GPU (n1-standard-16 though I don't think it matters 😄 ), running GKE engine 1.19.9-gke.1900

I suspect the allocated resources for it (mem limits) might not be enough. Maybe up it to 40Mi?
We're using the documented way of installing it as describe in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu

Is dockerfile of jupyter-notebook in example newest?

Hello,

I am trying to build jupyter-notebook image based on Dockerfile.cpu in example directory:

but got following error:

Package libav-tools is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  ffmpeg

E: Package 'libav-tools' has no installation candidate
The command '/bin/sh -c apt-get update && apt-get install -yq --no-install-recommends     apt-transport-https     build-essential     bzip2     ca-certificates     curl     emacs     fonts-liberation     g++     git     inkscape     jed     libav-tools     libcupti-dev     libsm6     libxext-dev     libxrender1     lmodern     locales     lsb-release     openssh-client     pandoc     pkg-config     python     python-dev     sudo     unzip     vim     wget     zip     zlib1g-dev     && apt-get clean &&     rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

After change libav-tools to ffmpeg will fix this error.So is the dockerfile newest or not?

CUDA unknown error when checking torch.cuda.is_available

I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.

Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:

>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0

For context, I'm running pytorch 1.11.0+cu102:

>>> torch.__version__
'1.11.0+cu102'

I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.

I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.

device plugin to emit metrics?

Just a question, is the device plugin responsible for emitting metrics for say allocation or is this handled by kubelet? cAdvisor had acceleration metrics previously which now is deprecated in k8s 1.11.

kubelet currently only has kubelet_device_plugin_alloc_latency_microseconds metric what I can see.

Would you be willing to accept a PR for exposing a metrics endpoint?

Consider adding cuda libraries to path for container mounts

It would be helpful if we can copy or mount in other paths that CUDA (and by extension TensorFlow) need to run CUDA jobs.

Tensorflow will look for libraries like libcupti within LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64.

I'm not sure how GKE is solving the CUDA availability. I think it might be sufficient to manually copy the libraries from /usr/local/cuda/extras/CUPTI/lib64 that contains
libcupti.so libcupti.so.9.0 libcupti.so.9.0.176.

Few options:

  • Simply document the need to copy libs to the correct path
  • Collect the necessary CUDA libraries at nvidia-installer runtime and move copy them at install-time (this assumes you installed CUDA on the host)
  • Add a new flag that can be used to mount CUDA libraries from the desired host-path

I can submit a PR for this if anyone finds it useful.

Init container erroring on 1.13.10-gke.0 Clusters

Cluster Version: 1.13.10-gke.0
OS: COS

Accelerator: K80

Config used:

https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Kube node version: 1.13.10-gke.0

nvidia-driver-installer init container fails

Failure logs:

 E 
2019-10-24T19:28:42.317771155Z [INFO    2019-10-24 19:28:42 UTC] Downloading prebuilt toolchain from https://storage.googleapis.com/chromiumos-sdk/2019/01/x86_64-cros-linux-gnu-2019.01.23.135701.tar.xz
 E 
2019-10-24T19:28:43.910979188Z 
 E 
2019-10-24T19:28:43.911019528Z real	0m1.592s
 E 
2019-10-24T19:28:43.911026583Z user	0m0.196s
 E 
2019-10-24T19:28:43.911030343Z sys	0m0.354s
 E 
2019-10-24T19:29:03.315272336Z /
 I 
2019-10-24T19:29:03.316660945Z [INFO    2019-10-24 19:29:03 UTC] Configuring environment variables for cross-compilation
 E 
2019-10-24T19:29:03.319151303Z [INFO    2019-10-24 19:29:03 UTC] Configuring installation directories
 E 
2019-10-24T19:29:03.319981075Z /usr/local/nvidia /
 I 
2019-10-24T19:29:03.350794019Z [INFO    2019-10-24 19:29:03 UTC] Updating container's ld cache
 E 
2019-10-24T19:29:03.609057980Z /
 I 
2019-10-24T19:29:03.610496603Z [INFO    2019-10-24 19:29:03 UTC] Configuring kernel sources
 E 
2019-10-24T19:29:03.610724643Z /build/usr/src/linux /
 I 
2019-10-24T19:29:04.180880974Z   HOSTCC  scripts/basic/fixdep
 I 
2019-10-24T19:29:04.507015722Z /bin/sh: 1: scripts/basic/fixdep: Permission denied
 E 
2019-10-24T19:29:04.507255541Z scripts/Makefile.host:102: recipe for target 'scripts/basic/fixdep' failed
 I 
2019-10-24T19:29:04.507305940Z make[1]: *** [scripts/basic/fixdep] Error 126
 E 
2019-10-24T19:29:04.507317478Z make[1]: *** Deleting file 'scripts/basic/fixdep'
 E 
2019-10-24T19:29:04.507643333Z make: *** [scripts_basic] Error 2
 E 
2019-10-24T19:29:04.507647588Z Makefile:464: recipe for target 'scripts_basic' failed
 I 

Getting X Insufficient nvidia.com/gpu error when scheduling pods

Summary:
Followed this guide to use GPU enabled nodes in our existing cluster but when we try to schedule pods we're getting X Insufficient nvidia.com/gpu error

Details:
We are trying to use GPU in our existing cluster and for that we're able to create a NodePool with a single node having GPU enabled.

Then as a next step according to the guide above we've to create a daemonset and we're able to run the DS successfully.

But now when we are trying to schedule the Pod using the following resource section the pod becomes un-schedulable with this error X insufficient nvidia.com/gpu

requests:
    limits:
       nvidia.com/gpu: "1"

Specs:
Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4

Is it okay to change Nvidia Driver version other than that given by Daemonset mentioned in GPU for Kubernetes Cluster Docs

We are currently using nvidia driver daemonset as given in https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Unfortunately it supports 410.79 nvidia driver which is compatible with kubernetes 1.12X version. But, we want to upgrade our driver to 440.x and also upgrade our cuda libraries which is currently cuda-10.0, we have also used Telsa T4 with tensor cores for GPU hardware. We are already at production but, some of our tensorflow code are not running. We have used tensorflow-gpu=2.0.0. Most surprising fact is, the same code runs perfectly on local computer with GPU support. The GPU specs used in local computer are Nivida GTX 1060 and 1080. It has become hard to track what issue might have caused the problem.

Request to provide Dockerfile source code for Nvidia driver installation on COS

Would it be possible for repo maintainers to provide the Dockerfile and any scripts used to generate the image by this daemonset? https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-nvidia-v450.yaml

The reason for this request is, I'd like to install a specific version (470.57.02) of Nvidia drivers on a GKE cluster running container-optimized OS with containerd. The official GKE documentation provides this daemonset, which installs an older driver version. I assume daemonset-nvidia-v450.yaml in this repo can be modified to install a specific driver, by changing this line to an appropriate image:

      - image: gcr.io/cos-cloud/cos-gpu-installer@sha256:93f1abf0d6a27e14bebf43ffb00b8d819b20f6027012ad73306ba670bcac6c83

However, I cannot find the source code for this image, so it is not clear how I can install a different Nvidia driver version.

For example, for GKE ubuntu images, this repo provides the Dockerfile and entrypoint.sh source code. Would it be possible to share the COS equivalent?

nvidia-driver-installer fails to install drivers for G2 instance type with L4

Description

I am trying to use G2 with L4 GPU in GKE

GKE Control Plane version v1.24.12-gke.500
Nodepool version version v1.24.9-gke.3200

Based on the documentation, these are the requirements of installing L4 GPU in Kubernetes.

Requirements

L4 GPUs:

  • You must use GKE version 1.22.17-gke.5400 or later.
  • You must ensure that you have enough quota for the underlying G2 Compute Engine machine type to use L4 GPUs.
  • The GKE version that you choose must include NVIDIA driver version 525 or later in Container-Optimized OS. If driver version 525 or later isn't the default or the latest version in your GKE version, you must manually install a supported driver on your nodes.

Daemonset drivers

For this I have configured a COS based g2-standard-12 nodepool which includes an L4 GPU by default and deployed it in my cluster.

I have ensured that I install the drivers mentioned in the documentation

>kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Then i noticed the pods are in a CrashLoopBackOff state

>kubectl -n kube-system get pods  -l k8s-app=nvidia-driver-installer 
NAME                            READY   STATUS                  RESTARTS      AGE
nvidia-driver-installer-8s57b   0/1     Init:CrashLoopBackOff   5 (49s ago)   6m19s
nvidia-driver-installer-g55lh   0/1     Init:CrashLoopBackOff   5 (52s ago)   6m24s

Logs

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
E0710 16:10:43.531586   12050 utils.go:355] 
E0710 16:10:43.552776   12050 utils.go:355] 
E0710 16:10:52.897020   11343 utils.go:355] 
E0710 16:10:52.917705   11343 utils.go:355] 
E0710 16:10:52.917725   11343 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:10:43.552800   12050 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:10:43.552804   12050 utils.go:355] 
I0710 16:10:43.553060   12050 installer.go:272] Done linking drivers
I0710 16:10:43.721063   12050 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:10:43.722880   12050 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:10:43.722904   12050 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:10:43.724544   12050 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:10:44.054791   12050 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
E0710 16:10:52.917729   11343 utils.go:355] 
I0710 16:10:52.917962   11343 installer.go:272] Done linking drivers
I0710 16:10:53.094929   11343 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:10:53.096824   11343 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:10:53.096845   11343 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:10:53.098248   11343 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:10:53.419759   11343 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

Latest Daemonset drivers

I then installed the latest daemonset

>kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

just to see more or less the same errors in the driver installer

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
I0710 16:15:18.143027   12652 installer.go:175] Linking drivers...
I0710 16:15:18.143081   12652 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.067520   13595 installer.go:175] Linking drivers...
I0710 16:15:18.067586   13595 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 16:15:18.255090   12652 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.270294   12652 installer.go:234] Done linking drivers
I0710 16:15:18.181461   13595 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 16:15:18.196234   13595 installer.go:234] Done linking drivers
I0710 16:15:18.366874   13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.368433   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.368451   13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.369846   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.889544   13595 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.889577   13595 installer.go:303] Running GPU driver installer
I0710 16:15:18.440314   12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:18.441901   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:18.441918   12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:18.443135   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:18.953207   12652 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 16:15:18.953238   12652 installer.go:303] Running GPU driver installer
I0710 16:15:28.274192   13595 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.313699   12652 installer.go:143] Extracting precompiled artifacts...
I0710 16:15:28.466472   13595 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.466501   13595 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.466554   13595 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
I0710 16:15:28.505712   12652 installer.go:170] Done extracting precompiled artifacts
I0710 16:15:28.505737   12652 installer.go:239] Linking drivers using legacy method...
I0710 16:15:28.505774   12652 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 16:15:29.007506   13595 utils.go:355] 
E0710 16:15:29.008106   13595 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.008128   13595 utils.go:355] 
E0710 16:15:29.008136   13595 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.008143   13595 utils.go:355] 
E0710 16:15:29.030758   13595 utils.go:355] 
E0710 16:15:29.030788   13595 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.030792   13595 utils.go:355] 
I0710 16:15:29.031035   13595 installer.go:272] Done linking drivers
E0710 16:15:29.034164   12652 utils.go:355] 
E0710 16:15:29.034910   12652 utils.go:355] ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
E0710 16:15:29.034929   12652 utils.go:355] 
E0710 16:15:29.034937   12652 utils.go:355] Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/usr/local/nvidia/nvidia-installer.log' for more information.
E0710 16:15:29.034942   12652 utils.go:355] 
E0710 16:15:29.056397   12652 utils.go:355] 
E0710 16:15:29.056422   12652 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 16:15:29.056427   12652 utils.go:355] 
I0710 16:15:29.056712   12652 installer.go:272] Done linking drivers
I0710 16:15:29.227756   13595 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.229349   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.229375   13595 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.230874   13595 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
I0710 16:15:29.230297   12652 modules.go:71] Loading gpu-key to keyring %keyring:.secondary_trusted_keys
I0710 16:15:29.231998   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.secondary_trusted_keys.
I0710 16:15:29.232023   12652 modules.go:71] Loading gpu-key to keyring %keyring:.ima
I0710 16:15:29.233532   12652 modules.go:83] Successfully load key gpu-key into keyring %keyring:.ima.
E0710 16:15:29.575466   13595 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1
E0710 16:15:29.580685   12652 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

(the logs are duplicate because the nodes are 2 and the nvidia-driver-installer pods are also 2).

Daemonset 525 drivers (installed manually)

I have also tried fetching the latest daemonset locally and edit it in order to install a specific version of the 525 driver (tried all of them, they all failed with the same error)

        command: ['/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--nvidia-installer-url=https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/93/tesla/525_00/525.85.12/NVIDIA-Linux-x86_64-525.85.12_93-16623-341-8.cos']

and the driver installation failed again

>kubectl -n kube-system logs -f -c nvidia-driver-installer -l k8s-app=nvidia-driver-installer  
I0710 17:11:04.007041   37416 install.go:205] Installing GPU driver from "https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos"
I0710 17:11:04.007442   37416 cos.go:31] Checking kernel module signing.
I0710 17:11:04.007462   37416 installer.go:106] Configuring driver installation directories
I0710 17:11:04.033106   37416 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools/16919.235.1/toolchain_env
I0710 17:11:04.106217   37416 cos.go:73] Installing the toolchain
I0710 17:11:04.106338   37416 cos.go:89] Found existing toolchain. Skipping download and installation.
I0710 17:11:04.106353   37416 cos.go:102] Found existing kernel headers. Skipping download and installation.
I0710 17:11:04.106377   37416 utils.go:88] Downloading Unofficial GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/97/tesla/525_00/525.105.17/NVIDIA-Linux-x86_64-525.105.17_97-16919-294-48.cos
I0710 17:11:08.647335   37416 installer.go:303] Running GPU driver installer


I0710 17:11:20.902272   37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:21.131986   37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:21.132014   37416 installer.go:175] Linking drivers...
I0710 17:11:21.132068   37416 installer.go:200] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0710 17:11:21.282000   37416 installer.go:211] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.10.161+/scripts/module.lds -r -o /tmp/extract/kernel/precompiled/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0710 17:11:21.297718   37416 installer.go:234] Done linking drivers
I0710 17:11:21.485882   37416 install.go:329] Failed to load kernel module, err: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1. Retrying driver installation with legacy linking
I0710 17:11:21.485910   37416 installer.go:303] Running GPU driver installer


I0710 17:11:33.823939   37416 installer.go:143] Extracting precompiled artifacts...
I0710 17:11:34.042251   37416 installer.go:170] Done extracting precompiled artifacts
I0710 17:11:34.042281   37416 installer.go:239] Linking drivers using legacy method...
I0710 17:11:34.042330   37416 installer.go:270] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license]
E0710 17:11:34.155440   37416 utils.go:355] 
E0710 17:11:34.155925   37416 utils.go:355] ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.
E0710 17:11:34.155940   37416 utils.go:355] 
E0710 17:11:34.155944   37416 utils.go:355] 
E0710 17:11:34.155948   37416 utils.go:355] ERROR: Installation has failed.  Please see the file '/usr/local/nvidia/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
E0710 17:11:34.155953   37416 utils.go:355] 
I0710 17:11:34.155986   37416 installer.go:272] Done linking drivers
E0710 17:11:34.209832   37416 install.go:356] failed to run GPU driver installer: failed to load GPU drivers: failed to load module /usr/local/nvidia/drivers/nvidia.ko: failed to load module nvidia (/usr/local/nvidia/drivers/nvidia.ko): failed to run command `insmod /usr/local/nvidia/drivers/nvidia.ko`: exit status 1

Conclusion

I didn't have such issues with other GPU types in the past. I have now switched to P100 on ubuntu but I am really interested in using G2 with L4 as it is a better fit for our use case.

Is there any way to have G2 with L4 GPU with a working driver in GKE with either Ubuntu or a COS image type?

cos daemonset-preloaded-latest fails with driver verification issue

I'm on a GKE 1.22 and tried to install nvidia driver 470 as per instruction by issuing the following command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

However the installation fails. Here is a log except:

Info
2022-05-20 06:14:58.805 EDTGetting the latest GPU driver version
Info
2022-05-20 06:14:58.805 EDTDownloading gpu_latest_version from https://storage.googleapis.com/cos-tools/16623.102.23/gpu_latest_version
Info
2022-05-20 06:14:58.849 EDTInstalling GPU driver version 470.82.01
Info
2022-05-20 06:14:58.849 EDTmap[BUILD_ID:16623.102.23 DRIVER_VERSION:470.82.01]
Info
2022-05-20 06:14:58.866 EDTLoading gpu-key to secondary system keyring
Info
2022-05-20 06:14:58.868 EDTSuccessfully load key gpu-key into secondary system keyring.
Info
2022-05-20 06:14:58.873 EDTVerifying GPU driver installation
Error
2022-05-20 06:14:58.902 EDTfailed to verify GPU driver installation: failed to verify GPU driver installation: exit status 255

when asked for 1 gpu device, gpu is available as /dev/nvidia1 not /dev/nvidia0

I have a server with two GPU in it. So when I asked for 1 gpu in the pod and there is no other gpu pod running on server, I get correct index device exposed into pod as /dev/nvidia0. But when I have one gpu pod already running on that server and I schedule another pod with 1 gpu in requests, it get GPU device exposed but its index is not /dev/nvidia0 but /dev/nvidia1 this causes some issue with the underlying applications.

I am running k8s 1.10 and device plugin image with hash 0842734032018be107fa2490c98156992911e3e1f2a21e059ff0105b07dd8e9e. Let me know if you need any more info but to me the issue might be in this code

nvidia-driver-installer failing to install cuda libraries on some pods

I have many pods running on the same cluster/node pool, which has the nvidia-driver-installer daemonset installed. A small fraction of them (~few percent) have workloads that fail due to missing libcuda.so.1. When I manually check, I find the /usr/local/nvidia directory is not present. See below, showing two pods -- one incorrectly installed, the other correctly installed.

Worst case, if this is not easily resolvable, is there some way to automatically detect and remove the pods/nodes that get setup incorrectly?

xxxx@cloudshell:~ (xxx)$ kubectl exec dask-worker-38dee33fc0e144d0bb7d34cdbefe3741-567h6 --namespace dask-gateway -c dask-worker -- ls /usr/local/nvidia            
ls: cannot access '/usr/local/nvidia': No such file or directory
command terminated with exit code 2
xxxx@cloudshell:~ (xxx)$ kubectl exec dask-worker-38dee33fc0e144d0bb7d34cdbefe3741-2p8kg --namespace dask-gateway -c dask-worker -- ls /usr/local/nvidia
NVIDIA-Linux-x86_64-418.67_77-12371-208-0.cos
bin
bin-workdir
drivers
drivers-workdir
lib64
lib64-workdir
nvidia-installer.log
share
vulkan

Nvidia driver failed while using cuda 10.0 in Kubernetes Cluster

we run docker containers in GKE(Google Kubernetes Engine) with 12.x version with cuda 10.0 version and cudnn>7.6.5 . Actually the nvidia driver installed as per the docs through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml gives 410.79 nvidia-driver in the cluster. But looks like the cuda installs different nvidia driver in the container and mismatch with the kernel version. While doing nvidia-smi in the container, we get:
Failed to initialize NVML: Driver/library version mismatch
How can we solve this issue?

ubuntu installer kernel module problems

I cannot get past the kernel module step install on a bare-metal. I have added the install log and further details:

  • K8s 1.9 installed with kubeadm
  • 4.4.0-116-generic #140-Ubuntu 16.04
  • Tesla K80
docker logs --follow k8s_nvidia-driver-installer_nvidia-driver-installer-zs54f_kube-system_c63262e5-217f-11e8-b744-06cf778dd7fd_0 
+ NVIDIA_DRIVER_VERSION=384.111
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-384.111.run
+ ROOT_MOUNT_DIR=/root
+ set +x
Downloading kernel sources...
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/universe Sources [73.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
Get:5 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [587 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:8 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:9 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [406 kB]
Get:10 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [3486 B]
Get:11 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:14 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [242 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [955 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [765 kB]
Get:19 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [18.5 kB]
Get:20 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [5153 B]
Get:21 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [7168 B]
Fetched 25.0 MB in 9s (2653 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  linux-headers-4.4.0-116
The following NEW packages will be installed:
  linux-headers-4.4.0-116 linux-headers-4.4.0-116-generic
0 upgraded, 2 newly installed, 0 to remove and 46 not upgraded.
Need to get 10.7 MB of archives.
After this operation, 78.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 linux-headers-4.4.0-116 all 4.4.0-116.140 [9922 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 linux-headers-4.4.0-116-generic amd64 4.4.0-116.140 [774 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 10.7 MB in 1s (7015 kB/s)
Selecting previously unselected package linux-headers-4.4.0-116.
(Reading database ... 9482 files and directories currently installed.)
Preparing to unpack .../linux-headers-4.4.0-116_4.4.0-116.140_all.deb ...
Unpacking linux-headers-4.4.0-116 (4.4.0-116.140) ...
Selecting previously unselected package linux-headers-4.4.0-116-generic.
Preparing to unpack .../linux-headers-4.4.0-116-generic_4.4.0-116.140_amd64.deb ...
Unpacking linux-headers-4.4.0-116-generic (4.4.0-116.140) ...
Setting up linux-headers-4.4.0-116 (4.4.0-116.140) ...
Setting up linux-headers-4.4.0-116-generic (4.4.0-116.140) ...
Downloading kernel sources... DONE.
Configuring installation directories...
/usr/local/nvidia /
Updating container's ld cache...
Updating container's ld cache... DONE.
/
Configuring installation directories... DONE.
Downloading Nvidia installer...
/usr/local/nvidia /
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 97.2M  100 97.2M    0     0  17.3M      0  0:00:05  0:00:05 --:--:-- 21.4M
/
Downloading Nvidia installer... DONE.
Running Nvidia installer...
/usr/local/nvidia /
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.111.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS
         will not function with this installation of the NVIDIA driver.


ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most
       frequently when this kernel module was built against the wrong or
       improperly configured kernel sources, with a version of gcc that
       differs from the one used to build the target kernel, or if a driver
       such as rivafb, nvidiafb, or nouveau is present and prevents the
       NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
       device(s), or no NVIDIA GPU installed in this system is supported by
       this NVIDIA Linux graphics driver release.
       
       Please see the log entries 'Kernel module load error' and 'Kernel
       messages' at the end of the file
       '/usr/local/nvidia/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

How to use -host-path

I'm trying to install this plugin on my kubernetes nodes whose Nvidia driver is located at /usr/lib/nvidia-384

I read following instruction.

You can specify the directory on the host containing nvidia libraries using -host-path

But I don't know how to use it. And example script or command?

How about fpga device plugin

Hi~
I found this repo only contains the gpu device plugin, any chance it will have fpga plugin? If it does not exist yet, I wonder how we can contribute one.
Thanks

ubuntu driver install container crashed when nvidia.ko is already loaded.

In https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/entrypoint.sh there is a if statement to check if nvidia kernel module is loaded. It looks like this (line 139)

  if ! lsmod | grep -q -w 'nvidia'; then
    insmod "${NVIDIA_INSTALL_DIR_CONTAINER}/drivers/nvidia.ko"
  fi
  if ! lsmod | grep -q -w 'nvidia_uvm'; then
    insmod "${NVIDIA_INSTALL_DIR_CONTAINER}/drivers/nvidia-uvm.ko"
  fi

In entrypoint.sh at line 17 set -o pipifail is defined. This causes the check at line 139 to be always true and always tries to load the nvidia.ko even if it is already loaded. This causes the container to crash.

A solution could be to remove set -o pipefail. I can do a PR if you want.

nvidia-driver-installer failed to build the driver

We are following the instructions under https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/. We have minikube up and running and installed the addons. Both of them don't come up though.

repro

Follow the tutorial. Check that the addons run successfully:

kubectl get pods -n kube-system

The init-container of kubectl-driver-installer fails. logs here:
https://gist.github.com/ensonic/e23518399f040307597aac1f84aa7d47

Now this done not help a lot, so we copied /usr/local/nvidia/nvidia-installer.log:
https://gist.github.com/ensonic/ae9e466993a85b641ac3f6467d9642fd

3. Information to attach (optional if deemed irrelevant)

The graphics card is a NVIDIA Corporation GP102 [TITAN Xp] (rev a1). It is unbound from the via vfio-pci.
Minikube is the 1.5.2
Linux Kernel is 5.2.17, but the driver installer is using 4.19.76 (from minikube kvm2)

Can you pin to a driver version that builds?

Installation on ubuntu 18.04 LTS VM on google cloud fails on "Unable to locate package linux-headers-5.3.0-1029-gcp"

I deployed Kubernetes with Kubespray on a set of Ubuntu VMs on Google Cloud.
One of the worker has a GPU (tesla K80).

When running the daemonset to install the Nvidia driver, I get the following error message:

+ NVIDIA_DRIVER_VERSION=384.111
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-384.111.run
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
++ uname -r
+ KERNEL_VERSION=5.3.0-1029-gcp
+ set +x
Checking cached version
Cache file /usr/local/nvidia/.cache not found.
Downloading kernel sources...
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:3 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
Get:4 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:8 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [1495 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.1 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [1032 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [19.7 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [7942 B]
Get:14 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [8807 B]
Get:15 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [1131 kB]
Get:16 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.7 kB]
Get:17 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [629 kB]
Get:18 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [6679 B]
Fetched 16.5 MB in 1s (8682 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-headers-5.3.0-1029-gcp
E: Couldn't find any package by glob 'linux-headers-5.3.0-1029-gcp'
E: Couldn't find any package by regex 'linux-headers-5.3.0-1029-gcp'

The GPU-worker node has the following specs:

System Info:
 Machine ID:                 8840fa09cb0c8bf2bb021033c01b5a14
 System UUID:                8840fa09-cb0c-8bf2-bb02-1033c01b5a14
 Boot ID:                    44a9e2ab-6039-4ceb-a6f6-a1372fac0fe5
 Kernel Version:             5.3.0-1029-gcp
 OS Image:                   Ubuntu 18.04.4 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.11
 Kubelet Version:            v1.18.3
 Kube-Proxy Version:         v1.18.3

When I check on the VM itself it the package is there, it seems to be that way:

svendegroote@worker-gpu-0:~$ apt search linux-headers | grep installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

linux-headers-5.3.0-1029-gcp/bionic-updates,bionic-security,now 5.3.0-1029.31~18.04.1 amd64 [installed]
linux-headers-gcp/bionic-updates,bionic-security,now 5.3.0.1029.23 amd64 [installed]

Information on the image from google cloud console:

image

Any idea why the script does not seem to find the installed linux headers?

(I also logged this issue on the Kubespray repo: kubernetes-sigs/kubespray#6340)

Using single GPU with multiple containers

I was able to find some unofficial support for this in case of the Nvidia plugin. Is there similar support for this plugin?

I'm interested in this feature and would be willing to implement it. I want to know if there are other people who would be interested in such a feature. And also if the maintainers of this repo would like to add such a feature.

Adding NUMA/TopologyManager support to gpu device plugin

Hi,

I’m interested in adding TopologyManager/NUMA support to your GPU device plugin.

I believe this is a case of

  1. Upgrade container-engine-accelerators/vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto to a newer version that includes TopologyInfo
  2. Find a way of mapping paths under /dev to paths under /sys (for example map /dev/nvidia0 to /sys/devices/pci0000:00/0000:00:05.0)
  3. Read file such as /sys/devices/pci0000:00/0000:00:05.0/numa_node and return over protobuf

Steps 1 and 3 should be easy enough. Step 2 is harder because of the need to get the PCI Id for a device.

For device->pci id mapping, options I’m aware of are

  1. It would be great if we could do the mapping just by looking under /sys. This is easy for disks (just look under /sys/block) but doesn’t appear possible for GPUs (I’d love to be proven wrong here).
  2. Make use of NVML’s nvmlDeviceGetPciInfo function. Making the device plugin use NVML has already been attempted at https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/52/files, but this PR was never merged. If we could get this PR merged, adding a call to nvmlDeviceGetPciInfo would be trivial.
  3. Run nvidia-smi then parse the output.

What are your thoughts? It would be great to agree a design before I start work on a new PR.

Kind regards,

Rob

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.