Skip to content

NVIDIA Device Plugin Fails to Detect GPU Despite Proper Environment Configuration: "Incompatible strategy detected auto" #1574

@UrmsOne

Description

@UrmsOne

What I tried

On Ubuntu 24.04, I have installed Kubernetes v1.34.3 and containerd v1.7.28. The GPU is NVIDIA GeForce RTX 5090. The NVIDIA driver, nvidia-container-toolkit, and containerd are configured.

Env

nvidia driver

root@gpu-node-5090-1:~/proxy# nvidia-smi
Sun Dec 21 17:39:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   35C    P8             11W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2725      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+
root@gpu-node-5090-1:~/proxy#

nvidia-container-toolkit version

root@gpu-node-5090-1:~/proxy# nvidia-container-cli --version
cli-version: 1.18.1
lib-version: 1.18.1
build date: 2025-11-24T14:45+00:00
build revision: 889a3bb5408c195ed7897ba2cb8341c7d249672f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64

containerd

root@gpu-node-5090-1:~/proxy# containerd version
INFO[2025-12-21T17:40:26.724003653+08:00] starting containerd                           revision= version=1.7.28

k8s

root@gpu-node-5090-1:~/proxy# kubectl version
Client Version: v1.34.3
Kustomize Version: v5.7.1
Server Version: v1.34.3

nvidia-device-plugin 0.17.1

what I met

Containers launched using the ctr command can use the nvidia-smi command normally inside.

root@gpu-node-5090-1:~/k8s-test# ctr -n k8s.io run  --rm    --runtime io.containerd.runc.v2     --gpus 0   docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04     cuda-test     nvidia-smi
Fri Dec 19 15:17:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   35C    P8             11W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But, in Kubernetes, the pod of nvidia-device-plugin can run normally, but its logs indicate that it fails to recognize the GPU.

NVIDIA Device Plugin pod is running

root@gpu-node-5090-1:~/proxy# kubectl -n kube-system get po  |grep nvidia
nvidia-device-plugin-daemonset-4ckwp      1/1     Running   0          42h

But,

I1219 12:04:10.062602       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
	3c378193
	commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea
 >
I1219 12:04:10.062626       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1219 12:04:10.062648       1 main.go:245] Starting OS watcher.
I1219 12:04:10.063447       1 main.go:260] Starting Plugins.
I1219 12:04:10.063460       1 main.go:317] Loading configuration.
I1219 12:04:10.063674       1 main.go:342] Updating config with default resource matching patterns.
I1219 12:04:10.063800       1 main.go:353]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1219 12:04:10.063806       1 main.go:356] Retrieving plugins.
E1219 12:04:10.063872       1 factory.go:112] Incompatible strategy detected auto
E1219 12:04:10.063875       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1219 12:04:10.063877       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1219 12:04:10.063880       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1219 12:04:10.063882       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1219 12:04:10.063885       1 main.go:381] No devices found. Waiting indefinitely.

How should I resolve this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions