Skip to content

[Issue]: PF-Passthrough VFIO binding fails silently due to missing lspci and script syntax error #417

@meng-chiang

Description

@meng-chiang

Problem Description

Summary

The automatic VFIO binding feature for PF-Passthrough mode does not work in GPU Operator v1.4.0/v1.4.1. The worker pod completes successfully but GPUs remain bound to the amdgpu driver instead of vfio-pci.

Environment

  • GPU Operator Version: v1.4.0 / v1.4.1
  • GPU Model: AMD Instinct MI210 (Device ID: 740f)
  • OS: Ubuntu

Expected Behavior

When driverType: pf-passthrough is configured, the GPU Operator should:

  1. Launch a worker pod to bind GPUs to vfio-pci
  2. Successfully execute vfio_bind.sh
  3. GPUs should be bound to vfio-pci driver

Actual Behavior

  1. Worker pod starts and exits with code 0 (success)
  2. vfio.ready label is set on the node
  3. GPUs remain bound to amdgpu driver (no actual binding occurred)

Root Cause Analysis

We identified two bugs causing this silent failure:

Bug 1: Missing lspci in gpu-operator-utils image

The vfio_bind.sh script requires lspci to discover GPU devices:

LSPCI_OUTPUT=$(lspci -nn -d 1002:${PRODUCT_CODE})

However, rocm/gpu-operator-utils:v1.4.0 does not include pciutils.
Result: lspci: command not found

Bug 2: Syntax error in vfio_bind.sh (line 30)

Current (incorrect - missing space after '[')

if [-e "/sys/bus/pci/devices/$VFIO_DEVICE/driver/unbind"]; then

Correct

if [ -e "/sys/bus/pci/devices/$VFIO_DEVICE/driver/unbind" ]; then

Operating System

Ubuntu 22.04

CPU

AMD EPYC 7643P 48-Core Processor

GPU

AMD MI210

ROCm Version

ROCm-SMI-LIB 7.7.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions