-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Problem Description
Summary
The automatic VFIO binding feature for PF-Passthrough mode does not work in GPU Operator v1.4.0/v1.4.1. The worker pod completes successfully but GPUs remain bound to the amdgpu driver instead of vfio-pci.
Environment
- GPU Operator Version: v1.4.0 / v1.4.1
- GPU Model: AMD Instinct MI210 (Device ID: 740f)
- OS: Ubuntu
Expected Behavior
When driverType: pf-passthrough is configured, the GPU Operator should:
- Launch a worker pod to bind GPUs to
vfio-pci - Successfully execute
vfio_bind.sh - GPUs should be bound to
vfio-pcidriver
Actual Behavior
- Worker pod starts and exits with code 0 (success)
vfio.readylabel is set on the node- GPUs remain bound to
amdgpudriver (no actual binding occurred)
Root Cause Analysis
We identified two bugs causing this silent failure:
Bug 1: Missing lspci in gpu-operator-utils image
The vfio_bind.sh script requires lspci to discover GPU devices:
LSPCI_OUTPUT=$(lspci -nn -d 1002:${PRODUCT_CODE})
However, rocm/gpu-operator-utils:v1.4.0 does not include pciutils.
Result: lspci: command not found
Bug 2: Syntax error in vfio_bind.sh (line 30)
Current (incorrect - missing space after '[')
if [-e "/sys/bus/pci/devices/$VFIO_DEVICE/driver/unbind"]; then
Correct
if [ -e "/sys/bus/pci/devices/$VFIO_DEVICE/driver/unbind" ]; then
Operating System
Ubuntu 22.04
CPU
AMD EPYC 7643P 48-Core Processor
GPU
AMD MI210
ROCm Version
ROCm-SMI-LIB 7.7.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response