-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Suggestion Description
I am currently testing GPU partitioning on AMD Instinct MI300X using the GPU Operator's config-manager, following the official guide:
https://rocm.blogs.amd.com/software-tools-optimization/gpu-operator-partitioning/README.html#gpu-partitioning
I have successfully set up the config-manager with a custom ConfigMap containing profiles (e.g., default and dpx-nps1). I can switch between profiles by labeling the node (e.g., kubectl label nodes <NODE> dcm.amd.com/gpu-config-profile=dpx-nps1), and the config-manager logs confirm that the partitioning is applied successfully. The node is also correctly tagged with dcm.amd.com/gpu-config-profile-state=success.
However, I noticed two issues regarding resource updates:
- Node Labels (Metadata): The capacity/resource labels (e.g.,
amd.com/gpu.vram,amd.com/gpu.cu-count) managed bynode-labellerdo not update to reflect the new partitioned state. - Allocatable Resources: The
amd.com/gpuresource count or device mapping managed by thedevice-plugindoes not refresh automatically.
To ensure Kubernetes correctly recognizes the updated resource capacity and device topology, I currently have to manually restart both the node-labeller and device-plugin pods.
Environment
- GPU Operator Version: v1.4.1
- Hardware: AMD Instinct MI300X OAM
- Driver Version: 6.16.13
Expected Behavior / Suggestion
Ideally, the Kubernetes node state should stay in sync with the actual hardware partition state without manual intervention.
Would it be possible to implement a mechanism where both node-labeller and device-plugin are automatically restarted (or triggered to re-scan) once the config-manager completes a profile change?
Since the config-manager already updates the node label to dcm.amd.com/gpu-config-profile-state=success, perhaps the Operator could watch for this state change and trigger a refresh of these components. This would ensure that:
- Node labels accurately reflect the new hardware specs (VRAM/CU).
- The
amd.com/gpuresource is correctly advertised to the Kubelet.
Thank you!
Operating System
No response
GPU
No response
ROCm Component
No response