-
Notifications
You must be signed in to change notification settings - Fork 777
Open
Description
What I tried
On Ubuntu 24.04, I have installed Kubernetes v1.34.3 and containerd v1.7.28. The GPU is NVIDIA GeForce RTX 5090. The NVIDIA driver, nvidia-container-toolkit, and containerd are configured.
Env
nvidia driver
root@gpu-node-5090-1:~/proxy# nvidia-smi
Sun Dec 21 17:39:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:02:00.0 Off | N/A |
| 0% 35C P8 11W / 575W | 15MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2725 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
root@gpu-node-5090-1:~/proxy#
nvidia-container-toolkit version
root@gpu-node-5090-1:~/proxy# nvidia-container-cli --version
cli-version: 1.18.1
lib-version: 1.18.1
build date: 2025-11-24T14:45+00:00
build revision: 889a3bb5408c195ed7897ba2cb8341c7d249672f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
containerd
root@gpu-node-5090-1:~/proxy# containerd version
INFO[2025-12-21T17:40:26.724003653+08:00] starting containerd revision= version=1.7.28
k8s
root@gpu-node-5090-1:~/proxy# kubectl version
Client Version: v1.34.3
Kustomize Version: v5.7.1
Server Version: v1.34.3
nvidia-device-plugin 0.17.1
what I met
Containers launched using the ctr command can use the nvidia-smi command normally inside.
root@gpu-node-5090-1:~/k8s-test# ctr -n k8s.io run --rm --runtime io.containerd.runc.v2 --gpus 0 docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04 cuda-test nvidia-smi
Fri Dec 19 15:17:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:02:00.0 Off | N/A |
| 0% 35C P8 11W / 575W | 15MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But, in Kubernetes, the pod of nvidia-device-plugin can run normally, but its logs indicate that it fails to recognize the GPU.
NVIDIA Device Plugin pod is running
root@gpu-node-5090-1:~/proxy# kubectl -n kube-system get po |grep nvidia
nvidia-device-plugin-daemonset-4ckwp 1/1 Running 0 42h
But,
I1219 12:04:10.062602 1 main.go:235] "Starting NVIDIA Device Plugin" version=<
3c378193
commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea
>
I1219 12:04:10.062626 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1219 12:04:10.062648 1 main.go:245] Starting OS watcher.
I1219 12:04:10.063447 1 main.go:260] Starting Plugins.
I1219 12:04:10.063460 1 main.go:317] Loading configuration.
I1219 12:04:10.063674 1 main.go:342] Updating config with default resource matching patterns.
I1219 12:04:10.063800 1 main.go:353]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I1219 12:04:10.063806 1 main.go:356] Retrieving plugins.
E1219 12:04:10.063872 1 factory.go:112] Incompatible strategy detected auto
E1219 12:04:10.063875 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1219 12:04:10.063877 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1219 12:04:10.063880 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1219 12:04:10.063882 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1219 12:04:10.063885 1 main.go:381] No devices found. Waiting indefinitely.
How should I resolve this issue?
DHclly
Metadata
Metadata
Assignees
Labels
No labels