Skip to content

[Feature]: Automate node-labeller and device-plugin restart after config-manager applies a new profile #420

@ctiml

Description

@ctiml

Suggestion Description

I am currently testing GPU partitioning on AMD Instinct MI300X using the GPU Operator's config-manager, following the official guide:
https://rocm.blogs.amd.com/software-tools-optimization/gpu-operator-partitioning/README.html#gpu-partitioning

I have successfully set up the config-manager with a custom ConfigMap containing profiles (e.g., default and dpx-nps1). I can switch between profiles by labeling the node (e.g., kubectl label nodes <NODE> dcm.amd.com/gpu-config-profile=dpx-nps1), and the config-manager logs confirm that the partitioning is applied successfully. The node is also correctly tagged with dcm.amd.com/gpu-config-profile-state=success.

However, I noticed two issues regarding resource updates:

  1. Node Labels (Metadata): The capacity/resource labels (e.g., amd.com/gpu.vram, amd.com/gpu.cu-count) managed by node-labeller do not update to reflect the new partitioned state.
  2. Allocatable Resources: The amd.com/gpu resource count or device mapping managed by the device-plugin does not refresh automatically.

To ensure Kubernetes correctly recognizes the updated resource capacity and device topology, I currently have to manually restart both the node-labeller and device-plugin pods.

Environment

  • GPU Operator Version: v1.4.1
  • Hardware: AMD Instinct MI300X OAM
  • Driver Version: 6.16.13

Expected Behavior / Suggestion

Ideally, the Kubernetes node state should stay in sync with the actual hardware partition state without manual intervention.

Would it be possible to implement a mechanism where both node-labeller and device-plugin are automatically restarted (or triggered to re-scan) once the config-manager completes a profile change?

Since the config-manager already updates the node label to dcm.amd.com/gpu-config-profile-state=success, perhaps the Operator could watch for this state change and trigger a refresh of these components. This would ensure that:

  1. Node labels accurately reflect the new hardware specs (VRAM/CU).
  2. The amd.com/gpu resource is correctly advertised to the Kubelet.

Thank you!

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationgood first issueGood for newcomersquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions