[Feature]: Automate `node-labeller` and `device-plugin` restart after config-manager applies a new profile

### Suggestion Description

I am currently testing GPU partitioning on AMD Instinct MI300X using the GPU Operator's `config-manager`, following the official guide:
https://rocm.blogs.amd.com/software-tools-optimization/gpu-operator-partitioning/README.html#gpu-partitioning

I have successfully set up the `config-manager` with a custom ConfigMap containing profiles (e.g., `default` and `dpx-nps1`). I can switch between profiles by labeling the node (e.g., `kubectl label nodes <NODE> dcm.amd.com/gpu-config-profile=dpx-nps1`), and the `config-manager` logs confirm that the partitioning is applied successfully. The node is also correctly tagged with `dcm.amd.com/gpu-config-profile-state=success`.

However, I noticed two issues regarding resource updates:
1.  **Node Labels (Metadata):** The capacity/resource labels (e.g., `amd.com/gpu.vram`, `amd.com/gpu.cu-count`) managed by `node-labeller` do not update to reflect the new partitioned state.
2.  **Allocatable Resources:** The `amd.com/gpu` resource count or device mapping managed by the `device-plugin` does not refresh automatically.

To ensure Kubernetes correctly recognizes the updated resource capacity and device topology, I currently have to **manually restart** both the `node-labeller` and `device-plugin` pods.

### Environment
* **GPU Operator Version:** v1.4.1
* **Hardware:** AMD Instinct MI300X OAM
* **Driver Version:** 6.16.13

### Expected Behavior / Suggestion
Ideally, the Kubernetes node state should stay in sync with the actual hardware partition state without manual intervention.

**Would it be possible to implement a mechanism where both `node-labeller` and `device-plugin` are automatically restarted (or triggered to re-scan) once the `config-manager` completes a profile change?**

Since the `config-manager` already updates the node label to `dcm.amd.com/gpu-config-profile-state=success`, perhaps the Operator could watch for this state change and trigger a refresh of these components. This would ensure that:
1.  Node labels accurately reflect the new hardware specs (VRAM/CU).
2.  The `amd.com/gpu` resource is correctly advertised to the Kubelet.

Thank you!

### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Automate `node-labeller` and `device-plugin` restart after config-manager applies a new profile #420

Suggestion Description

Environment

Expected Behavior / Suggestion

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Automate node-labeller and device-plugin restart after config-manager applies a new profile #420

Description

Suggestion Description

Environment

Expected Behavior / Suggestion

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Automate `node-labeller` and `device-plugin` restart after config-manager applies a new profile #420