From a923c3213c769a09bf67c6c1547291c121b5b48e Mon Sep 17 00:00:00 2001 From: yansun1996 Date: Thu, 5 Feb 2026 08:04:13 +0000 Subject: [PATCH] [DOC] Fix docs sanity Signed-off-by: yansun1996 --- .github/workflows/linting.yml | 8 +- .markdownlint-cli2.jsonc | 7 + .wordlist.txt | 192 +++++++++++++++--- README.md | 21 +- docs/autoremediation/auto-remediation.md | 86 ++++---- docs/dcm/device-config-manager.md | 8 +- docs/dcm/systemd_integration.md | 45 ++-- docs/device_plugin/device-plugin.md | 7 +- docs/device_plugin/resource-allocation.md | 1 + docs/drivers/installation.md | 108 +++++----- docs/drivers/precompiled-driver.md | 104 +++++----- docs/drivers/upgrading.md | 4 +- docs/index.md | 1 - docs/installation/kubernetes-helm.md | 7 +- docs/installation/openshift-olm.md | 15 +- docs/knownlimitations.md | 7 +- docs/kubevirt/kubevirt.md | 52 +++-- docs/metrics/exporter.md | 1 - docs/metrics/health.md | 3 +- docs/metrics/kube-rbac-proxy.md | 3 + docs/metrics/prometheus-openshift.md | 21 +- docs/metrics/prometheus.md | 23 ++- docs/npd/node-problem-detector.md | 27 +-- docs/overview.md | 4 +- docs/releasenotes.md | 46 +++-- docs/slinky/slinky-example.md | 5 +- .../airgapped-install-openshift.md | 32 +-- docs/test/agfhc.md | 157 +++++++------- docs/test/appendix-test-recipe.md | 1 - docs/test/auto-unhealthy-device-test.md | 1 + docs/test/logs-export.md | 5 +- docs/test/manual-test.md | 10 + docs/test/pre-start-job-test.md | 21 +- docs/test/test-runner-overview.md | 2 +- docs/troubleshooting.md | 23 +-- docs/upgrades/upgrade.md | 10 +- .../metricsExporter/mtls-rbac-auth/README.md | 28 ++- .../token-based-auth/README.md | 16 +- helm-charts-k8s/README.md | 21 +- 39 files changed, 698 insertions(+), 435 deletions(-) create mode 100644 .markdownlint-cli2.jsonc diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml index 484cc667a..31aab75b3 100644 --- a/.github/workflows/linting.yml +++ b/.github/workflows/linting.yml @@ -2,15 +2,11 @@ name: Linting on: push: - branches: - - develop + branches: - main - - staging pull_request: - branches: - - develop + branches: - main - - staging jobs: call-workflow-passing-data: diff --git a/.markdownlint-cli2.jsonc b/.markdownlint-cli2.jsonc new file mode 100644 index 000000000..aafd3602b --- /dev/null +++ b/.markdownlint-cli2.jsonc @@ -0,0 +1,7 @@ +{ + "globs": ["**/*.md"], + "ignores": [ + "**/vendor/**", + "**/.git/**" + ] +} diff --git a/.wordlist.txt b/.wordlist.txt index 2231950e3..a6efcba9e 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -1,61 +1,203 @@ +amd +AFID +Affectioned +AGFHC +Allocatable +ACS +AKS +ARI Autobuild +bb +burnin +CheckUnitStatus +CleanupPreState CLI +CN CNI +computePartition +ConfigMap +ConfigMaps +ConditionalWorkflows +CoreOS +CPX +CrashLoopBackOff CRD -DKMS -DNS -DockerHub -GPUs -HTTPS -KMM -MOK -NFD -OLM -PCI -RBAC -ROCm -TLS -YAML -allocatable -bool -calico -clusterIP -config -configmap -cryptographic +CRDs +CRs +CronJob +Customizable daemonset +daemonsets +DaemonSet +Daemonsets +DCM +dcm +Depricated +deivce DeviceConfig +DeviceIDs +DevicePlugin +DevicePluginArguments +DevicePluginImage +DevicePluginImagePullPolicy +DevicePluginSpec +DKMS +dma +DMC +DNS Dockerfile +DockerHub +DPX +DriverToolkit +ECC +EnableNodeLabeller +ErrImagePull flannel +GPUs +gpup +Grafana +GracePeriodSeconds +gst +gpuagent +gpuClientSystemdServices +GKE +hbm +HealthThresholds +Helmify hostname hostnames +HSIO +HTTPS +iet +IfNotPresent +IgnoreDaemonSets +IgnoreNamespaces +ImageStream +jq json kaniko +KMM +kmod +kubectl +Kubelet +KubeVirt +Kuberntes Kubernetes kubeconfig labeller +Labeler lifecycle +lvl MachineConfig -modprobe +MachineConfigOperator +MCO +Mericsclient +MaxParallelWorkflows +MaxUnavailable +MCO +memoryPartition +MetricsExporter +MetricsExporterSpec +MinIO +Minio +MOK +MTLS namespace +NFD NMC +NodeCondition +NodeDrainPolicy +NodeIP +NodeLabeller +NodeLabellerArguments +NodeLabellerImage +NodeLabellerImagePullPolicy +Nodelabeller +nodename +Nodeport NodePort +NodeRemediationLabels +NodeRemediationTaints +NoExecute +NPD NotReady -OperatorHub +numGPUsAssigned +Observability +oc +OLM +OOM OpenShift +OperatorHub +Openshift +parition +paritioning +pbqt +pebb +PCI +pcie +perf +PFs +Perses plugin +PodIP PreFlight +PreStateDB prometheus +Promethues quay -rocminfo +QPX +RAS +RBAC +Redhat RedHat +RHCOS +RMA +rocminfo +rochpl +ROCm runtime +SAR schedulable SDK +selfcheck +ServiceAccounts +ServiceMonitor +ServiecMonitor +skippedGPUs +Slinkproject +SlinkProject +Slrum +Slurm +SPX +StopOnFailure +SubjectAccessReview systemd +TestCategory +TesterImage +TimeoutSeconds +TokenReview +Tolerations +TODO +TLS +tolerations +tst +TtlForFailedWorkflows ubuntu +UI +UID +UNCORRECT +Uncordoning +uninstallation unschedulable +Upgrademgr +UpgradePolicy validation verison +VC +VCN +VFIO +VFs +VMs webhook -uninstallation \ No newline at end of file +xgmi +YAML \ No newline at end of file diff --git a/README.md b/README.md index 764fec04a..d199dc416 100644 --- a/README.md +++ b/README.md @@ -83,20 +83,19 @@ Installation Options > It is strongly recommended to use AMD-optimized KMM images included in the operator release. This is not required when installing the GPU Operator on Red Hat OpenShift. ### 3. Install Custom Resource -After the installation of AMD GPU Operator: - * By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` - * If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. - * For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource). - * Potential Failures with default `DeviceConfig`: +After the installation of AMD GPU Operator: - a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes. +* By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` +* If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. +* For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource). +* Potential Failures with default `DeviceConfig`: + a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes. `kubectl edit deviceconfigs -n kube-amd-gpu default` - - b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node. - * Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"` - * If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"` - * You can always customize the node selector of the `DeviceConfig`. + b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node. + * Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"` + * If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"` + * You can always customize the node selector of the `DeviceConfig`. ### Grafana Dashboards diff --git a/docs/autoremediation/auto-remediation.md b/docs/autoremediation/auto-remediation.md index 382ce32c2..c4dab6b72 100644 --- a/docs/autoremediation/auto-remediation.md +++ b/docs/autoremediation/auto-remediation.md @@ -6,42 +6,42 @@ The GPU Operator provides automatic remediation for GPU worker nodes that become The following diagram illustrates the end-to-end flow of automatic remediation: -``` +```text ┌─────────────────────────────────────────────────────────────────────────────┐ -│ GPU Worker Node │ +│ GPU Worker Node │ ├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌────────────────────────┐ │ -│ │ Device Metrics │ │ -│ │ Exporter │ Reports inband-RAS errors │ -│ └───────────┬────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌────────────────────────┐ │ -│ │ Node Problem │ Queries for inband-RAS errors │ -│ │ Detector (NPD) │ and marks node condition as True │ -│ └───────────┬────────────┘ │ -│ │ │ -└──────────────┼────────────────────────────────────────────────────────────────┘ +│ │ +│ ┌────────────────────────┐ │ +│ │ Device Metrics │ │ +│ │ Exporter │ Reports inband-RAS errors │ +│ └───────────┬────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌────────────────────────┐ │ +│ │ Node Problem │ Queries for inband-RAS errors │ +│ │ Detector (NPD) │ and marks node condition as True │ +│ └───────────┬────────────┘ │ +│ │ │ +└──────────────┼──────────────────────────────────────────────────────────────┘ │ │ Node condition status update ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ -│ Controller Node │ +│ Controller Node │ ├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌────────────────────────┐ │ -│ │ GPU Operator │ Observes node error conditions │ -│ │ │ │ -│ └───────────┬────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌────────────────────────┐ │ -│ │ Argo Workflow │ Triggers remediation workflow │ -│ │ Controller │ for the affected node │ -│ └────────────────────────┘ │ -│ │ │ -└──────────────┼────────────────────────────────────────────────────────────────┘ +│ │ +│ ┌────────────────────────┐ │ +│ │ GPU Operator │ Observes node error conditions │ +│ │ │ │ +│ └───────────┬────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌────────────────────────┐ │ +│ │ Argo Workflow │ Triggers remediation workflow │ +│ │ Controller │ for the affected node │ +│ └────────────────────────┘ │ +│ │ │ +└──────────────┼──────────────────────────────────────────────────────────────┘ │ │ Executes remediation steps ▼ @@ -69,7 +69,7 @@ The GPU Operator Helm installation includes the following Argo Workflows compone The GPU Operator installs Argo Workflows v3.6.5, using a [customized installation YAML](https://github.com/argoproj/argo-workflows/releases/download/v3.6.5/install.yaml) tailored for auto-remediation requirements. This customization excludes components not needed for remediation, such as the Argo workflow server. For more information about Argo Workflows concepts, refer to the [official documentation](https://argo-workflows.readthedocs.io/en/release-3.6/workflow-concepts/). > **Note:** By default, auto-remediation components (workflow controller and CRDs) are installed during Helm deployment. To disable the installation of these components, use the following Helm flag: -> +> > ```bash > --set remediation.enabled=false > ``` @@ -80,17 +80,17 @@ The GPU Operator installs Argo Workflows v3.6.5, using a [customized installatio The DeviceConfig Custom Resource includes a `RemediationWorkflowSpec` section for configuring and customizing the auto-remediation feature: -```yaml +```golang type RemediationWorkflowSpec struct { - Enable *bool + Enable *bool - ConditionalWorkflows *v1.LocalObjectReference + ConditionalWorkflows *v1.LocalObjectReference - TtlForFailedWorkflows int + TtlForFailedWorkflows int - TesterImage string + TesterImage string - MaxParallelWorkflows int + MaxParallelWorkflows int NodeRemediationLabels map[string]string @@ -138,9 +138,9 @@ The `NodeDrainPolicy` field accepts a `DrainSpec` object with the following conf **IgnoreNamespaces** - Defines a list of namespaces to exclude from pod eviction during the drain operation. Pods running in these namespaces will remain on the node, allowing critical infrastructure components to continue operating throughout the remediation process. By default, the following namespaces are excluded: `kube-system`, `cert-manager`, and the GPU Operator's namespace. -### Other Configuration options: +### Other Configuration options -**NPD Configuration** - NPD configuration is explained in more detail [here](../npd/node-problem-detector.md). The Node Problem Detector (NPD) DaemonSet must continue running during workflow execution to verify issue resolution. Add the following toleration to the NPD DaemonSet: +**NPD Configuration** - NPD configuration is explained in more detail [in this section](../npd/node-problem-detector.md). The Node Problem Detector (NPD) DaemonSet must continue running during workflow execution to verify issue resolution. Add the following toleration to the NPD DaemonSet: `amd-gpu-unhealthy:NoSchedule op=Exists` @@ -197,7 +197,7 @@ The following example demonstrates a complete error mapping configuration: > **Note:** The `default-template` is automatically created on the cluster by the GPU Operator. -The `default-template` workflow performs the following remediation steps: +The `default-template` workflow performs the following remediation steps: 1. **Label Node** - Applies custom labels to the node as specified in the `NodeRemediationLabels` field of the DeviceConfig Custom Resource. If no labels are configured, this step is skipped and the workflow proceeds to the next step. @@ -223,7 +223,7 @@ Each workflow step is executed as a separate Kubernetes pod. For advanced use ca While most workflow steps are self-explanatory, Steps 4, 5, and 7 require additional clarification. -### Workflow Step 4: Physical Intervention Check +### Workflow Step 4: Physical Intervention Check According to the AMD service action guide, certain GPU issues require physical intervention (e.g., checking wiring, securing screws, retorquing connections). When such conditions are detected, the workflow generates a Kubernetes event to notify the administrator of the required physical action before suspending at this step. The specific physical action for each node condition is defined in the `physicalActionNeeded` field within the corresponding ConfigMap mapping. @@ -233,10 +233,10 @@ This step enables administrators to identify nodes awaiting physical interventio The GPU Operator determines whether to automatically resume the workflow after it pauses in Step 4. This pause accommodates scenarios requiring manual intervention. The workflow may remain suspended in two primary cases: -1. **Excessive Remediation Attempts:** - When a `RecoveryPolicy` is configured in the `ConditionalWorkflowMappings` ConfigMap, it defines the maximum remediation attempts allowed within a specified time window. Nodes exceeding this threshold will have their workflows paused indefinitely until manual resumption. +1. **Excessive Remediation Attempts:** + When a `RecoveryPolicy` is configured in the `ConditionalWorkflowMappings` ConfigMap, it defines the maximum remediation attempts allowed within a specified time window. Nodes exceeding this threshold will have their workflows paused indefinitely until manual resumption. 2. **Physical Action Required:** - When a physical action is specified for a workflow in the `ConditionalWorkflowMappings` ConfigMap, the workflow pauses at this step, allowing administrators to perform the required maintenance. A notification event is generated to alert the user. + When a physical action is specified for a workflow in the `ConditionalWorkflowMappings` ConfigMap, the workflow pauses at this step, allowing administrators to perform the required maintenance. A notification event is generated to alert the user. If neither condition applies, the workflow automatically resumes without manual intervention. diff --git a/docs/dcm/device-config-manager.md b/docs/dcm/device-config-manager.md index 1aa6b5010..52aac7ab6 100644 --- a/docs/dcm/device-config-manager.md +++ b/docs/dcm/device-config-manager.md @@ -5,10 +5,12 @@ The Device Config Manager (DCM) is a component of the GPU Operator that is used to handle the configuration of AMD Instinct GPUs, specifically in regards to GPU partitioning. In the future, DCM will also be expanded to handle the configuration of AMD's AI-NIC. Like other GPU Operator components DCM runs as a daemonset on each GPU node in your cluster. DCM can be enabled via the GPU Operator's custom resource called "DeviceConfig". The current goal of the Device Config Manager is to handle the configuration and implementation of GPU partitioning on your Kubernetes cluster, allowing for partitioning modes to be set on each GPU Node based on partition profiles that you specify via a Kubernetes config-map. ## Supported Platforms - - Ubuntu 22.04, Ubuntu 24.04 + +- Ubuntu 22.04, Ubuntu 24.04 ## ROCM version - - ROCM 6.3, ROCM 6.4 + +- ROCM 6.3, ROCM 6.4 ## GPU Partition Overview @@ -74,4 +76,4 @@ kube-amd-gpu gpu-operator-device-plugin-zft6k ``` After DCM has completed the partitioning and once device plugin is brought up again, the resources (whether single gpus or partitioned gpus) are represented on the k8s node as per this documentation: -[Device Plugin Resources](../device_plugin/device-plugin.md) \ No newline at end of file +[Device Plugin Resources](../device_plugin/device-plugin.md) diff --git a/docs/dcm/systemd_integration.md b/docs/dcm/systemd_integration.md index 8e070ea08..317261918 100644 --- a/docs/dcm/systemd_integration.md +++ b/docs/dcm/systemd_integration.md @@ -1,24 +1,25 @@ # Device Config Manager Systemd Integration -## Background +## Background The Device Config Manager (DCM) orchestrates hardware-level tasks such as GPU partitioning. Before initiating partitioning, it gracefully stops specific systemd services defined in a configmap to prevent any processes (gpuagent, etc) from partition interference and ensure consistent device states ## K8S ConfigMap enhancement -The configmap contains a key "gpuClientSystemdServices" which declares the list of services to manage: +The configmap contains a key "gpuClientSystemdServices" which declares the list of services to manage: ```yaml "gpuClientSystemdServices": { - "names": ["amd-metrics-exporter", "gpuagent"] + "names": ["amd-metrics-exporter", "gpuagent"] } ``` + - These are the unit names (without the. service suffix) of systemd services related to GPU runtime agents. We add the suffix as a part of the code -- Users can add/modify services to the above list +- Users can add/modify services to the above list ## ConfigMap -```yaml +```yaml apiVersion: v1 kind: ConfigMap metadata: @@ -57,8 +58,8 @@ data: } }, "gpuClientSystemdServices": { - "names": ["amd-metrics-exporter", "gpuagent"] - } + "names": ["amd-metrics-exporter", "gpuagent"] + } } ``` @@ -73,26 +74,26 @@ data: ## Workflow -- DCM uses D-Bus APIs to query, stop, and restart systemd services programmatically, ensuring precise service orchestration. +- DCM uses D-Bus APIs to query, stop, and restart systemd services programmatically, ensuring precise service orchestration. -- Extract Service List: On startup, DCM parses the configmap and retrieves the names array under gpuClientSystemdServices. Each entry is appended with (. service) to form full unit names. +- Extract Service List: On startup, DCM parses the configmap and retrieves the names array under gpuClientSystemdServices. Each entry is appended with (. service) to form full unit names. - Capture Pre-State: - - For each service: - - It checks status using D-Bus via `org.freedesktop.systemd1.Manager.GetUnit.` - - Stores current state (e.g. `active`, `inactive`, `not-loaded`) in PreStateDB. - - This DB is used for restoring service state post-partitioning. + - For each service: + - It checks status using D-Bus via `org.freedesktop.systemd1.Manager.GetUnit.` + - Stores current state (e.g. `active`, `inactive`, `not-loaded`) in PreStateDB. + - This DB is used for restoring service state post-partitioning. -- Stop Services: Services are stopped gracefully using D-Bus APIs. This ensures they release GPU resources and don't disrupt the partitioning operation. We check if the service is present before stopping it using the CheckUnitStatus API. +- Stop Services: Services are stopped gracefully using D-Bus APIs. This ensures they release GPU resources and don't disrupt the partitioning operation. We check if the service is present before stopping it using the CheckUnitStatus API. -- Perform Partitioning: Once services are stopped temporarily, DCM initiates the partitioning logic (using node labels/configmap profiles) and completes the partitioning workflow +- Perform Partitioning: Once services are stopped temporarily, DCM initiates the partitioning logic (using node labels/configmap profiles) and completes the partitioning workflow -- Restart & Restore State After partitioning: - - DCM checks PreStateDB to determine which services were previously active. - - Only those Services are restarted accordingly using the D-Bus invocation APIs. - - Additionally, PreStateDB is cleared via a CleanupPreState() function to reset the tracker DB for the next run. +- Restart & Restore State After partitioning: + - DCM checks PreStateDB to determine which services were previously active. + - Only those Services are restarted accordingly using the D-Bus invocation APIs. + - Additionally, PreStateDB is cleared via a CleanupPreState() function to reset the tracker DB for the next run. -# Conclusion +## Conclusion -- Avoids GPU contention during partitioning (device-busy errors aren’t seen during partition) -- Maintains service continuity with minimal downtime \ No newline at end of file +- Avoids GPU contention during partitioning (device-busy errors aren't seen during partition) +- Maintains service continuity with minimal downtime diff --git a/docs/device_plugin/device-plugin.md b/docs/device_plugin/device-plugin.md index c0c91ffba..8ed4b11e9 100644 --- a/docs/device_plugin/device-plugin.md +++ b/docs/device_plugin/device-plugin.md @@ -60,6 +60,7 @@ test-deviceconfig-node-labeller-bxk7x 1/1 Runnin | **EnableNodeLabeller** | Enable/Disable node labeller with True/False | | **DevicePluginArguments** | The flag/values to pass on to Device Plugin | | **NodeLabellerArguments** | The flags to pass on to Node Labeller | +
1. Both the `ImagePullPolicy` fields default to `Always` if `:latest` tag is specified on the respective Image, or defaults to `IfNotPresent` otherwise. This is default k8s behaviour for `ImagePullPolicy` @@ -71,7 +72,7 @@ test-deviceconfig-node-labeller-bxk7x 1/1 Runnin - {"compute-memory-partition", "compute-partitioning-supported", "memory-partitioning-supported"} - For the above new partition labels, the labels being set under this field will be applied by nodelabeller on the node - The below labels are enabled by nodelabeller by default internally : + The below labels are enabled by nodelabeller by default internally: - {"vram", "cu-count", "simd-count", "device-id", "family", "product-name", "driver-version"} ## How to choose Resource Naming Strategy @@ -79,7 +80,7 @@ test-deviceconfig-node-labeller-bxk7x 1/1 Runnin To customize the way device plugin reports gpu resources to kubernetes as allocatable k8s resources, use the `single` or `mixed` resource naming strategy in **DeviceConfig** CR Before understanding each strategy, please note the definition of homogeneous and heterogeneous nodes -Homogeneous node: A node whose gpu's follow the same compute-memory partition style +Homogeneous node: A node whose gpu's follow the same compute-memory partition style -> Example: A node of 8 GPU's where all 8 GPU's are following CPX-NPS4 partition style Heterogeneous node: A node whose gpu's follow different compute-memory partition styles @@ -118,7 +119,7 @@ A node which has 8 GPUs where 5 GPU's are following SPX-NPS1 and 3 GPU's are fol ```bash amd.com/spx_nps1: 5 amd.com/cpx_nps1: 24 -``` +``` #### **Notes** diff --git a/docs/device_plugin/resource-allocation.md b/docs/device_plugin/resource-allocation.md index f6a8d5190..c9ca29d82 100644 --- a/docs/device_plugin/resource-allocation.md +++ b/docs/device_plugin/resource-allocation.md @@ -11,6 +11,7 @@ Device Plugin has allocator package where we can define multiple policies on how ### Best-effort Allocation Policy Currently we use ```best-effort``` policy as the default allocation policy. This policy choses GPUs based on topology of the GPUs to ensure optimal affinity and better performance. During initialization phase, Device Plugin calculates a score for every pair of GPUs and stores it in memory. This score is calculated based on below criteria: + - Type of connectivity link between the pair. Most common AMD GPU deployments use either XGMI or PCIE links to connect the GPUs. ```XGMI``` connectivity offers better performance than PCIE connectivity. The score assigned for a pair connected using XGMI is lower than that of a pair connected using PCIE(lower score is better) - [NUMA affinity](https://rocm.blogs.amd.com/software-tools-optimization/affinity/part-1/README.html) of the GPU pair. GPU pair that is part of same NUMA domain get lower score than pair from different NUMA domains. - For scenarios that involve partitioned GPUs, partitions from same GPU are assigned better score than partitions from different GPUs. diff --git a/docs/drivers/installation.md b/docs/drivers/installation.md index 4fcf2d128..665d63e85 100644 --- a/docs/drivers/installation.md +++ b/docs/drivers/installation.md @@ -162,9 +162,11 @@ spec: ``` ```{note} -As for the configuration in `spec.driver.imageBuild`: -1. If the base OS image or source image is hosted in a registry that requires pull secrets to pull those images, you need to use `spec.driver.imageRegistrySecret` to inject the pull secret. -2. `spec.driver.imageRegistrySecret` was originally designed for providing secret to pull/push image to the repository specified in `spec.driver.image`, if unfortunately the base image and source image requires different secret to pull, please combine the access information into one single Kubernetes secret. +When configuring `spec.driver.imageBuild`, consider the following registry authentication requirements: + +1. **Single Secret for Multiple Registries**: If your base OS image or source image is hosted in a registry requiring pull secrets, use `spec.driver.imageRegistrySecret` to inject credentials. This secret was originally designed for the repository in `spec.driver.image`, but can be combined to support multiple registries. + +2. **Combining Multiple Registry Credentials**: If base and source images require different secrets, combine them into a single Kubernetes secret: ```bash REGISTRY1=https://index.docker.io/v1/ @@ -176,12 +178,12 @@ As for the configuration in `spec.driver.imageBuild`: cat > config.json < config.json <:$(echo $BUILDER_TOKEN | base64 -d)" | base64 -w0)" - }, - "image-registry.openshift-image-registry.svc.cluster.local:5000": { - "auth": "$(echo -n ":$(echo $BUILDER_TOKEN | base64 -d)" | base64 -w0)" - }, - "${REGISTRY1}": { - "auth": "$(echo -n "${USER1}:${PWD1}" | base64 -w0)" - } + "image-registry.openshift-image-registry.svc:5000": { + "auth": "$(echo -n ":$(echo $BUILDER_TOKEN | base64 -d)" | base64 -w0)" + }, + "image-registry.openshift-image-registry.svc.cluster.local:5000": { + "auth": "$(echo -n ":$(echo $BUILDER_TOKEN | base64 -d)" | base64 -w0)" + }, + "${REGISTRY1}": { + "auth": "$(echo -n "${USER1}:${PWD1}" | base64 -w0)" + } } } EOF @@ -253,10 +255,8 @@ As for the configuration in `spec.driver.imageBuild`: echo "✅ Secret '${SECRET_NAME}' created and ready." ``` - ``` - #### Configuration Reference To list existing `DeviceConfig` resources run `kubectl get deviceconfigs -A` @@ -265,55 +265,55 @@ To check the full spec of `DeviceConfig` definition run `kubectl get crds device #### `metadata` Parameters -| Parameter | Description | -|-----------|-------------| -| `name` | Unique identifier for the resource | -| `namespace` | Namespace where the operator is running | +| Parameter | Description | +| ----------- | -------------------------------------------- | +| `name` | Unique identifier for the resource | +| `namespace` | Namespace where the operator is running | #### `spec.driver` Parameters -| Parameter | Description | Default | -|-----------|-------------|-------------| -| `enable` | set to true for installing out-of-tree driver,
set it to false then operator will skip driver install
and directly use inbox / pre-installed driver | `true` | -| `blacklist` | set to true then operator will init node labeller daemonset
to add `amdgpu` into selected worker nodes modprobe blacklist,
set to false then operator will remove `amdgpu`
from selected nodes' modprobe blacklist | `false` | -| `version` | ROCm driver version (e.g., "6.2.2")
[See ROCm Versions](https://rocm.docs.amd.com/en/latest/release/versions.html) | Ubuntu: `6.1.3`
CoresOS: `6.2.2` | -| `image` | Registry URL and repository (without tag)
*Note: Operator manages tags automatically* | Vanilla k8s: `image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod`
OpenShift: `image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod` | -| `imageRegistrySecret.name` | Name of registry credentials secret
to pull/push driver image | | -| `imageRegistryTLS.insecure` | If true, check if the container image
already exists using plain HTTP | `false` | -| `imageRegistryTLS.insecureSkipTLSVerify` | If true, skip any TLS server certificate validation | `false` | -| `imageSign.keySecret` | secret name of the private key
used to sign kernel modules after image building in cluster
see [secure boot](./secure-boot) doc for instructions to create the secret | | -| `imageSign.certSecret` | secret name of the public key
used to sign kernel modules after image building in cluster
see [secure boot](./secure-boot) doc for instructions to create the secret | | -| `tolerations` | List of tolerations that will be set for KMM module object and its components like build pod and worker pod | | -| `imageBuild.baseImageRegistry` | registry to host base OS image, e.g. when using Ubuntu 22.04 worker node with specified baseImageRegistry `docker.io` the operator will use base image from `docker.io/ubuntu:22.04` | `docker.io` | -| `imageBuild.baseImageRegistryTLS.insecure` | If true, check if the container image
already exists using plain HTTP | `false` | -| `imageBuild.baseImageRegistryTLS.insecureSkipTLSVerify` | If true, skip any TLS server certificate validation | `false` | -| `imageBuild.sourceImageRepo` | (Currently only applied to OpenShift) Image repository to host amdgpu source code image, operator will auto determine the image tag based on users system and `spec.driver.version`. E.g. for building driver from ROCm 7.0 + RHEL 9.6 + default source image repo, the image would be `docker.io/rocm/amdgpu-driver:coreos-9.6-7.0` | `docker.io/rocm/amdgpu-driver` | +| Parameter | Description | Default | +|---------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| +| `enable` | Set to true for installing out-of-tree driver.
Set to false to skip driver install and use inbox/pre-installed driver. | `true` | +| `blacklist` | Set to true to have the operator init the node labeller DaemonSet and add `amdgpu` to the selected worker nodes' modprobe blacklist.
Set to false to remove `amdgpu` from the selected nodes' modprobe blacklist. | `false` | +| `version` | ROCm driver version (e.g., "6.2.2").
See ROCm Versions: https://rocm.docs.amd.com/en/latest/release/versions.html | Ubuntu: `6.1.3`
CoreOS: `6.2.2` | +| `image` | Registry URL and repository (without tag).
Note: Operator manages tags automatically. | Vanilla k8s: `image-registry:5000/$MOD_NAMESPACE/amdgpu_kmod`
OpenShift: `image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod` | +| `imageRegistrySecret.name` | Name of registry credentials secret to pull/push driver image. | | +| `imageRegistryTLS.insecure` | If true, check if the container image already exists using plain HTTP. | `false` | +| `imageRegistryTLS.insecureSkipTLSVerify` | If true, skip any TLS server certificate validation. | `false` | +| `imageSign.keySecret` | Secret name of the private key used to sign kernel modules after image building in cluster.
See secure boot doc for instructions to create the secret: `./secure-boot`. | | +| `imageSign.certSecret` | Secret name of the public key used to sign kernel modules after image building in cluster.
See secure boot doc for instructions to create the secret: `./secure-boot`. | | +| `tolerations` | List of tolerations that will be set for KMM Module object and components like build pod and worker pod. | | +| `imageBuild.baseImageRegistry` | Registry hosting the base OS image.
Example: With Ubuntu 22.04 worker node and `docker.io`, the operator uses `docker.io/ubuntu:22.04` as base image. | `docker.io` | +| `imageBuild.baseImageRegistryTLS.insecure` | If true, check if the container image already exists using plain HTTP. | `false` | +| `imageBuild.baseImageRegistryTLS.insecureSkipTLSVerify` | If true, skip any TLS server certificate validation. | `false` | +| `imageBuild.sourceImageRepo` | (OpenShift only) Image repository hosting the amdgpu source code image. The operator determines the image tag based on the system and `spec.driver.version`.
Example: ROCm 7.0 + RHEL 9.6 → `docker.io/rocm/amdgpu-driver:coreos-9.6-7.0`. | `docker.io/rocm/amdgpu-driver` | #### `spec.devicePlugin` Parameters -| Parameter | Description | Default | -|-----------|-------------|---------| -| `devicePluginImage` | AMD GPU device plugin image | `rocm/k8s-device-plugin:latest` | -| `nodeLabellerImage` | Node labeller image | `rocm/k8s-device-plugin:labeller-latest` | -| `imageRegistrySecret.name` | Name of registry credentials secret
to pull device plugin / node labeller image | | -| `enableNodeLabeller` | enable / disable node labeller | `true` | +| Parameter | Description | Default | +|---------------------------------|-------------------------------------------------------------------------------------|------------------------------------------| +| `devicePluginImage` | AMD GPU device plugin image | `rocm/k8s-device-plugin:latest` | +| `nodeLabellerImage` | Node labeller image | `rocm/k8s-device-plugin:labeller-latest` | +| `imageRegistrySecret.name` | Name of registry credentials secret
to pull device plugin / node labeller image | | +| `enableNodeLabeller` | enable / disable node labeller | `true` | #### `spec.metricsExporter` Parameters | Parameter | Description | Default | -|-----------|-------------|---------| +| --------- | ----------- | ------- | | `enable` | Enable/disable metrics exporter | `false` | | `imageRegistrySecret.name` | Name of registry credentials secret
to pull metrics exporter image | | | `serviceType` | Service type for metrics endpoint
Options: "ClusterIP" or "NodePort" | `ClusterIP` | -| `port` | clsuter IP's internal service port
for reaching the metrics endpoint | `5000` | +| `port` | cluster IP's internal service port
for reaching the metrics endpoint | `5000` | | `nodePort` | Port number when using NodePort service type | automatically assigned | | `selector` | select which nodes to enable metrics exporter | same as `spec.selector` | #### `spec.selector` Parameters -| Parameter | Description | Default | -|-----------|-------------|---------| -| `selector` | Labels to select nodes for driver installation | `feature.node.kubernetes.io/amd-gpu: "true"` | +| Parameter | Description | Default | +|------------|-------------------------------------------------|-------------------------------------------------| +| `selector` | Labels to select nodes for driver installation +| `feature.node.kubernetes.io/amd-gpu: "true"` | ### Registry Secret Configuration diff --git a/docs/drivers/precompiled-driver.md b/docs/drivers/precompiled-driver.md index 4656fa702..cdda42037 100644 --- a/docs/drivers/precompiled-driver.md +++ b/docs/drivers/precompiled-driver.md @@ -22,18 +22,16 @@ KMM determines the appropriate driver image based on the combination of: KMM looks for driver images based on tags, the controller will use these methods to determine the image tag: 1. Parse the node's `osImage` field to determine the OS and version `kubectl get node -oyaml | grep -i osImage`: +2. Read the node's `kernelVersion` field to determine to kernel version `kubectl get node -oyaml | grep -i kernelVersion`. +3. Read user configured amdgpu driver version from `DeviceConfig` field `spec.driver.version`. | osImage | OS | version | -|---------|-----------|-------------------| +| --------- | ----------- | ------------------- | | `Ubuntu 24.04.1 LTS` | `Ubuntu` | `24.04` | | `Red Hat Enterprise Linux CoreOS 9.6.20250916-0 (Plow)` | `coreos` | `9.6` | -2. Read the node's `kernelVersion` field to determine to kernel version `kubectl get node -oyaml | grep -i kernelVersion`. -3. Read user configured amdgpu driver version from `DeviceConfig` field `spec.driver.version`. - - | OS | Tag Format | Example Image Tag | -|----|------------|-------------------| +| ---- | ------------ | ------------------- | | `ubuntu` | `ubuntu---` | `ubuntu-22.04-6.8.0-40-generic-6.1.3` | | `coreos` | `coreos---` | `coreos-9.6-5.14.0-427.28.1.el9_4.x86_64-6.2.2` | @@ -47,9 +45,9 @@ When a DeviceConfig is created with driver management enabled (`spec.driver.enab ### Ubuntu -Follow these image build steps to get a pre-compiled driver images, make sure your system matched with [ROCm required Linux system requirement](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html). +Follow these image build steps to get a pre-compiled driver images, make sure your system matched with [ROCm required Linux system requirement](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html). -1. Prepare the Dockerfile +#### Step 1: Prepare the Dockerfile ```dockerfile ARG OS_VERSION @@ -112,7 +110,7 @@ Build Steps Explanation: - Kernel modules: `/opt/lib/modules/${KERNEL_FULL_VERSION}/` - Firmware files: `/firmwareDir/updates/amdgpu/` -2. Trigger the build with the Dockerfile +#### Step 2: Trigger the build with the Dockerfile Make sure the build node has the same OS and kernel with your production nodes. @@ -130,7 +128,7 @@ docker build \ -t registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION} . ``` -3. Push to the image to a registry +#### Step 3: Push to the image to a registry ```bash docker push registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION} @@ -140,19 +138,20 @@ docker push registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)- Follow these image build steps to get a pre-compiled driver images for OpenShift cluster, make sure your RHEL version and driver version matched with [ROCm required Linux system requirement](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html). -1. Collect System Information +#### Step 1: Collect System Information Please collect system information from OpenShift build node before configuring the build process: -* kernel version: `uname -r` -* kernel compatible OpenShift DriverToolkit image: `oc adm release info --image-for driver-toolkit` +- kernel version: `uname -r` +- kernel compatible OpenShift DriverToolkit image: `oc adm release info --image-for driver-toolkit` -2. Prepare image registry: +#### Step 2: Prepare image registry Please decide where you want to push your pre-compiled driver image: - * Case 1: Use OpenShift internal registry: - * Enable internal registry (skip this step if you already enabled registry): +- Case 1: Use OpenShift internal registry: + - Enable internal registry (skip this step if you already enabled registry): + ```bash oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"storage":{"emptyDir":{}}}}' @@ -161,35 +160,40 @@ Please decide where you want to push your pre-compiled driver image: # make sure the image registry pods are running oc get pods -n openshift-image-registry ``` - * Create ImageStream + + - Create ImageStream + ```bash oc create imagestream amdgpu_kmod ``` - * Case 2: Use external image registry: - * Create secret to push image if required: + +- Case 2: Use external image registry: + - Create secret to push image if required: + ```bash kubectl create secret docker-registry docker-auth \ --docker-server=registry.example.com \ --docker-username=xxx \ --docker-password=xxx ``` -3. Create OpenShift `BuildConfig` -Please create the following YAML file, the full example is assuming you are using OpenShift internal image registry and build config will be saved in default namespace. +#### Step 3: Create OpenShift `BuildConfig` -* If you want to configure the build in other namespace, please change the namespace accordingly in the example steps. -* If you want to use other image registry, please replace the `spec.output` part with this: +Please create the following YAML file, the full example is assuming you are using OpenShift internal image registry and build config will be saved in default namespace. -```yaml -spec: - output: - pushSecret: - name: docker-auth - to: - kind: DockerImage - # follow the Image Tag Format section to get your image ta - name: registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0 -``` +- If you want to configure the build in other namespace, please change the namespace accordingly in the example steps. +- If you want to use other image registry, please replace the `spec.output` part with this: + + ```yaml + spec: + output: + pushSecret: + name: docker-auth + to: + kind: DockerImage + # follow the Image Tag Format section to get your image ta + name: registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0 + ``` Full example: @@ -274,22 +278,22 @@ spec: COPY --from=builder /lib/firmware/updates/amdgpu /firmwareDir/updates/amdgpu ``` -4. Trigger driver image build - -* Option 1 - Web Console: - * Login to OpenShift web console with username and password - * Select `Builds` then select `BuildConfigs` in the navigation bar - * Click `Create BuildConfig` then select YAML view, copy over the YAML file created in last step - * Select the `BuildConfig` in the list, click `Actions` then select `Start Build` - * Select `Builds` in the current `BuildConfig` page, a new build should be triggered and in running status. - * Wait for it to be completed, you can also monitor the progress in `Logs` section, in the end it should show push is successful. - * Delete the `BuildConfig` if needed. -* Option 2 - Command Line Interface (CLI): - * Create the `BuildConfig` by using the YAML file created in the last step: `oc apply -f build-config.yaml` - * Start the build: `oc start-build amd-gpu-operator-build` - * Check the build status: `oc get build` and `oc get pods | grep build` - * Wait for it to complete, the logs should show that push is successful - * Delete the `BuildConfig` if needed: `oc delete -f build-config.yaml` +#### Step 4: Trigger driver image build + +- Option 1 - Web Console: + - Login to OpenShift web console with username and password + - Select `Builds` then select `BuildConfigs` in the navigation bar + - Click `Create BuildConfig` then select YAML view, copy over the YAML file created in last step + - Select the `BuildConfig` in the list, click `Actions` then select `Start Build` + - Select `Builds` in the current `BuildConfig` page, a new build should be triggered and in running status. + - Wait for it to be completed, you can also monitor the progress in `Logs` section, in the end it should show push is successful. + - Delete the `BuildConfig` if needed. +- Option 2 - Command Line Interface (CLI): + - Create the `BuildConfig` by using the YAML file created in the last step: `oc apply -f build-config.yaml` + - Start the build: `oc start-build amd-gpu-operator-build` + - Check the build status: `oc get build` and `oc get pods | grep build` + - Wait for it to complete, the logs should show that push is successful + - Delete the `BuildConfig` if needed: `oc delete -f build-config.yaml` ## Using Pre-compiled Images @@ -328,4 +332,4 @@ kubectl create secret docker-registry docker-auth \ --docker-password=xxx ``` -- if you are hosting driver images in DockerHub, you don't need to specify the parameter ```--docker-server``` +- if you are hosting driver images in DockerHub, you don't need to specify the parameter `--docker-server` diff --git a/docs/drivers/upgrading.md b/docs/drivers/upgrading.md index 00f22da07..c1645ca54 100644 --- a/docs/drivers/upgrading.md +++ b/docs/drivers/upgrading.md @@ -141,8 +141,8 @@ The following are considered during the automatic upgrade process If it is observed that the upgrade status is in failed state for a specific node, the user can debug the node, fix it and then add this label to the node to restart upgrade on it. The upgrade state will be reset and it can be tracked as it was before - - Command: `kubectl label node operator.amd.com/gpu-driver-upgrade-state=upgrade-required` - - Label: `operator.amd.com/gpu-driver-upgrade-state: upgrade-required` +- Command: `kubectl label node operator.amd.com/gpu-driver-upgrade-state=upgrade-required` +- Label: `operator.amd.com/gpu-driver-upgrade-state: upgrade-required` ## 2. Manual Upgrade Process diff --git a/docs/index.md b/docs/index.md index 42715e2cc..b719e30ff 100644 --- a/docs/index.md +++ b/docs/index.md @@ -61,7 +61,6 @@ Below is a matrix of supported Operating systems and the corresponding Kubernete - Please refer to the [ROCM documentation](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) for the compatibility matrix for the AMD GPU DKMS driver. ## Prerequisites diff --git a/docs/installation/kubernetes-helm.md b/docs/installation/kubernetes-helm.md index 1f6efc7ce..78f5d301d 100644 --- a/docs/installation/kubernetes-helm.md +++ b/docs/installation/kubernetes-helm.md @@ -303,9 +303,10 @@ helm upgrade amd-gpu-operator amd/gpu-operator-helm \ ## Install Custom Resource After the installation of AMD GPU Operator: - * If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` - * If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. - * For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](../drivers/installation). Here are some examples for common deployment scenarios. + +- If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` +- If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. +- For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](../drivers/installation). Here are some examples for common deployment scenarios. ### Inbox or Pre-Installed AMD GPU Drivers diff --git a/docs/installation/openshift-olm.md b/docs/installation/openshift-olm.md index 54ed9b614..a9fab0c68 100644 --- a/docs/installation/openshift-olm.md +++ b/docs/installation/openshift-olm.md @@ -111,7 +111,7 @@ oc get pods -n openshift-image-registry Create an NFD custom resource to detect AMD GPU hardware, based on different deployment scenarios you need to choose creating `NodeFeatureDiscovery` or `NodeFeatureRule`. -* If your OpenShift cluster doesn't have `NodeFeatureDiscovery` deployed +- If your OpenShift cluster doesn't have `NodeFeatureDiscovery` deployed Please create the ```NodeFeatureDiscovery``` under the namespace where NFD operator is running: @@ -190,7 +190,7 @@ spec: ]} ``` -* If your OpenShift cluster already has `NodeFeatureDiscovery` deployed +- If your OpenShift cluster already has `NodeFeatureDiscovery` deployed You can alternatively create a namespaced `NodeFeatureRule` custom resource to avoid modifying `NodeFeatureDiscovery` which could possibly interrupt the existing node label. @@ -301,6 +301,7 @@ spec: ``` Things to note: + 1. By default, there is no need to specify the image field in CR for Openshift. Default will be used which is: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod 2. If users specify image, $MOD_NAMESPACE can be a place holder , KMM Operator can automatically translate it to the namespace @@ -331,7 +332,7 @@ oc get node -o json | grep amd.com In order to enable the OpenShift native cluster monitoring stack to scrape metrics from metrics exporter, please: -* Label the namespace with OpenShift specific cluster monitoring label +- Label the namespace with OpenShift specific cluster monitoring label For example if AMD GPU Operator was deployed in namespace `openshift-amd-gpu`: @@ -339,7 +340,7 @@ For example if AMD GPU Operator was deployed in namespace `openshift-amd-gpu`: oc label namespace openshift-amd-gpu openshift.io/cluster-monitoring="true" ``` -* Enable the metrics exporter and configure the `serviceMonitor` in `DeviceConfig` +- Enable the metrics exporter and configure the `serviceMonitor` in `DeviceConfig` For example: @@ -357,9 +358,9 @@ spec: After applying this configuration, verify the metrics are being collected: -* Navigate to the OpenShift web console -* Go to **Observe** → **Targets** to confirm the metrics target is active -* Go to **Observe** → **Metrics** to query AMD GPU metrics +- Navigate to the OpenShift web console +- Go to **Observe** → **Targets** to confirm the metrics target is active +- Go to **Observe** → **Metrics** to query AMD GPU metrics ## Uninstallation diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md index 78dae827f..92e64858e 100644 --- a/docs/knownlimitations.md +++ b/docs/knownlimitations.md @@ -27,12 +27,7 @@ 5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** - ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention - ***Affected Configurations:*** All configurations - - ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: - - ```bash - kubectl cordon - ``` - + - ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: `kubectl cordon `.

6. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift** diff --git a/docs/kubevirt/kubevirt.md b/docs/kubevirt/kubevirt.md index ef2e0f8a2..46bf7d73c 100644 --- a/docs/kubevirt/kubevirt.md +++ b/docs/kubevirt/kubevirt.md @@ -1,3 +1,4 @@ + # KubeVirt Integration ## Overview @@ -23,55 +24,65 @@ The AMD GPU Operator now supports integration with [**KubeVirt**](https://kubevi You need to set up System BIOS to enable the virtualization related features. For example, sample System BIOS settings will look like this (depending on vendor and BIOS version): -* SR-IOV Support: Enable this option in the Advanced → PCI Subsystem Settings page. +- SR-IOV Support: Enable this option in the Advanced → PCI Subsystem Settings page. -* Above 4G Decoding: Enable this option in the Advanced → PCI Subsystem Settings page. +- Above 4G Decoding: Enable this option in the Advanced → PCI Subsystem Settings page. -* PCIe ARI Support: Enable this option in the Advanced → PCI Subsystem Settings page. +- PCIe ARI Support: Enable this option in the Advanced → PCI Subsystem Settings page. -* IOMMU: Enable this option in the Advanced → NB Configuration page. +- IOMMU: Enable this option in the Advanced → NB Configuration page. -* ACS Enabled: Enable this option in the Advanced → NB Configuration page. +- ACS Enabled: Enable this option in the Advanced → NB Configuration page. ### GRUB Config Update -* Edit GRUB Configuration File: +- Edit GRUB Configuration File: Use a text editor to modify the /etc/default/grub file (Following example uses “nano” text editor). Open the terminal and run the following command: + ```bash sudo nano /etc/default/grub ``` -* Modify the `GRUB_CMDLINE_LINUX` Line: +- Modify the `GRUB_CMDLINE_LINUX` Line: Look for the line that begins with `GRUB_CMDLINE_LINUX`. Modify it to include following parameters, : + ```bash GRUB_CMDLINE_LINUX="modprobe.blacklist=amdgpu iommu=on amd_iommu=on" ``` + If there are already parameters in the quotes, append your new parameters separated by spaces. + ```{note} Note: In case host machine is running Intel CPU, replace `amd_iommu` with `intel_iommu`. ``` -* After modifying the configuration file, you need to update the GRUB settings by running the following command: +- After modifying the configuration file, you need to update the GRUB settings by running the following command: + ```bash sudo update-grub ``` -* Reboot Your System: +- Reboot Your System: For the changes to take effect, reboot your system using the following command: + ```bash sudo reboot ``` -* Verifying changes: +- Verifying changes: After the system reboots, confirm that the GRUB parameters were applied successfully by running: + ```bash cat /proc/cmdline ``` + When you run the command above, you should see a line that includes: + ```bash modprobe.blacklist=amdgpu iommu=on amd_iommu=on ``` -This indicates that your changes have been applied correctly.  + +This indicates that your changes have been applied correctly. ## Configure KubeVirt @@ -81,6 +92,7 @@ After properly installing the KubeVirt, there will be a KubeVirt custom resource 2. Add the PF or VF PCI device information to the host devices permitted list. For example, in order to add MI300X VF: + ```yaml $ kubectl get kubevirt -n kubevirt kubevirt -oyaml apiVersion: kubevirt.io/v1 @@ -113,6 +125,7 @@ In order to bring up guest VM with VF based GPU-Passthrough, [AMD MxGPU GIM Driv If you already prepared the GPU hosts with GIM driver pre-installed and want to directly use it, you don't have to ask AMD GPU Operator to install it for you: 1. Disable the out-of-tree driver management in `DeviceConfig`: + ```yaml spec: driver: @@ -120,6 +133,7 @@ spec: ``` 2. Make sure the AMD GPU VF on your host is already bound to `vfio-pci` kernel module. + ```bash $ lspci -nnk | grep 1002 -A 3 85:00.0 Processing accelerators [1200]: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X] [1002:74a1] @@ -133,6 +147,7 @@ $ lspci -nnk | grep 1002 -A 3 ``` 3. Verify that the VF has been advertised as a resource by device plugin: + ```yaml $ kubectl get node -oyaml | grep -i allocatable -A 5 allocatable: @@ -144,6 +159,7 @@ $ kubectl get node -oyaml | grep -i allocatable -A 5 If you don't have GIM driver installed on the GPU hosts, AMD GPU Operator can help you install the out-of-tree GIM kernel module to your hosts and automatically bind the VF devices to the `vfio-pci` kernel module to make it ready for passthrough: 1. Enable the out-of-tree driver management in `DeviceConfig`: + ```yaml spec: driver: @@ -177,6 +193,7 @@ spec: ``` 2. Verify that the worker node is labeled with proper driver type and vfio ready labels: + ```yaml $ kubectl get node -oyaml | grep operator.amd gpu.operator.amd.com/kube-amd-gpu.test-deviceconfig.driver: vf-passthrough @@ -184,6 +201,7 @@ $ kubectl get node -oyaml | grep operator.amd ``` 3. Verify that the AMD GPU VF on your host is bound to `vfio-pci` kernel module. + ```bash $ lspci -nnk | grep 1002 -A 3 85:00.0 Processing accelerators [1200]: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X] [1002:74a1] @@ -197,6 +215,7 @@ $ lspci -nnk | grep 1002 -A 3 ``` 4. Verify that the VF has been advertised as a resource by device plugin: + ```yaml $ kubectl get node -oyaml | grep -i allocatable -A 5 allocatable: @@ -212,6 +231,7 @@ In order to bring up guest VM with PF based GPU-Passthrough, you don't have to i If you are using your own method to manage the PF device and it is already bound with `vfio-pci`, please: 1. Disable the driver management of AMD GPU Operator: + ```yaml spec: driver: @@ -219,6 +239,7 @@ spec: ``` 2. Verify that the AMD GPU PF on your host is already bound to `vfio-pci` kernel module. + ```bash $ lspci -nnk | grep 1002 -A 3 85:00.0 Processing accelerators [1200]: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X] [1002:74a1] @@ -228,6 +249,7 @@ $ lspci -nnk | grep 1002 -A 3 ``` 3. Verify that the PF has been advertised as a resource by device plugin: + ```yaml $ kubectl get node -oyaml | grep -i allocatable -A 5 allocatable: @@ -235,9 +257,11 @@ $ kubectl get node -oyaml | grep -i allocatable -A 5 ``` #### Use AMD GPU Operator to manage PF-Passthrough vfio binding + The AMD GPU Operator can help you bind the AMD GPU PF device to the `vfio-pci` kernel module on all the selected GPU hosts: 1. Configure the `DeviceConfig` custom resource to use PF-Passthrough: + ```yaml spec: driver: @@ -256,6 +280,7 @@ spec: ``` 2. Verify that the worker node is labeled with proper driver type and vfio ready labels: + ```yaml $ kubectl get node -oyaml | grep operator.amd gpu.operator.amd.com/kube-amd-gpu.test-deviceconfig.driver: pf-passthrough @@ -263,6 +288,7 @@ $ kubectl get node -oyaml | grep operator.amd ``` 3. Verify that the AMD GPU PF on your host is bound to `vfio-pci` kernel module. + ```bash $ lspci -nnk | grep 1002 -A 3 85:00.0 Processing accelerators [1200]: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X] [1002:74a1] @@ -272,13 +298,13 @@ $ lspci -nnk | grep 1002 -A 3 ``` 4. Verify that the PF has been advertised as a resource by device plugin: + ```yaml $ kubectl get node -oyaml | grep -i allocatable -A 5 allocatable: amd.com/gpu: "1" ``` - ## GPU Operator Components ### Device Plugin @@ -305,6 +331,7 @@ Similar to the Device Plugin, the Node Labeler can auto-detect the operational m Key labels for PF and VF passthrough modes are listed below. Placeholders like ``, ``, ``, and `` represent actual device IDs (e.g., `74a1`, `74b5`), device counts, and GIM driver versions (e.g., `8.1.0.K`) respectively. **PF Passthrough Mode Labels:** + - `amd.com/gpu.mode=pf-passthrough` - `beta.amd.com/gpu.mode=pf-passthrough` - `amd.com/gpu.device-id=` @@ -312,6 +339,7 @@ Key labels for PF and VF passthrough modes are listed below. Placeholders like ` - `beta.amd.com/gpu.device-id.=` **VF Passthrough Mode Labels:** + - `amd.com/gpu.mode=vf-passthrough` - `beta.amd.com/gpu.mode=vf-passthrough` - `amd.com/gpu.device-id=` diff --git a/docs/metrics/exporter.md b/docs/metrics/exporter.md index dbc05d99f..31cc3a291 100644 --- a/docs/metrics/exporter.md +++ b/docs/metrics/exporter.md @@ -27,7 +27,6 @@ | 6.4.x | 6.12.12 | v1.3.0 | MI3xx | | 6.4.x | 6.12.12 | v1.3.0.1 | MI2xx, MI3xx | - ## Configure metrics exporter To start the Device Metrics Exporter along with the GPU Operator configure the ``` spec/metricsExporter/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter diff --git a/docs/metrics/health.md b/docs/metrics/health.md index 1f1387496..21f156478 100644 --- a/docs/metrics/health.md +++ b/docs/metrics/health.md @@ -56,7 +56,8 @@ information. amd.com/gpu: 1 ``` -### GPU Health Status : Unhealthy +### GPU Health Status: Unhealthy + The GPU health status if reported as "Unhealthy" on any node, makes the GPU unavailable for k8s jobs, any job requesting AMD GPU will not be scheduled in unhealthy GPU, but if any job is already scheduled will not be evicted when the GPU transitions from Healthy -> Unhealthy. If there are no job assciated with the GPU and a new request for GPU on unhealthy is created on K8s, the Job will be pending state and will not be allowed to run on an unhealthy GPU. This will reduce the number of Allocatable entries on the node by the total number of unhealthy GPU reported on that node. diff --git a/docs/metrics/kube-rbac-proxy.md b/docs/metrics/kube-rbac-proxy.md index 985d2fab6..abed7b92c 100644 --- a/docs/metrics/kube-rbac-proxy.md +++ b/docs/metrics/kube-rbac-proxy.md @@ -63,6 +63,7 @@ kubectl create configmap my-client-ca --from-file=ca.crt=path/to/ca.crt -n kube- ## DeviceConfig Configuration Examples Token-Based Authorization: + ```yaml metricsExporter: rbacConfig: @@ -74,6 +75,7 @@ metricsExporter: ``` mTLS with Certificate based RBAC Authorization: + ```yaml metricsExporter: rbacConfig: @@ -86,6 +88,7 @@ metricsExporter: ``` mTLS with Static Authorization: + ```yaml metricsExporter: rbacConfig: diff --git a/docs/metrics/prometheus-openshift.md b/docs/metrics/prometheus-openshift.md index 37ec55247..ca8293368 100644 --- a/docs/metrics/prometheus-openshift.md +++ b/docs/metrics/prometheus-openshift.md @@ -4,11 +4,12 @@ The AMD GPU Operator integrates with Prometheus to enable monitoring of GPU metr Prometheus integration is managed via the **ServiceMonitor** configuration in the DeviceConfig Custom Resource (CR). When enabled, the operator automatically creates a ServiceMonitor tailored to the metrics exported by the Device Metrics Exporter. The integration supports various authentication and authorization methods, including Bearer Tokens and mutual TLS (mTLS), providing flexibility to accommodate different security requirements. -Openshift has its own integrated Prometheus instances which we will utilize instead of a separate operator that vanilla k8s environments would utilize. Additionally, Openshift natively supports Perses for dashboards instead of grafana which is supported with our vanilla k8s deployment guide. +Openshift has its own integrated Prometheus instances which we will utilize instead of a separate operator that vanilla k8s environments would utilize. Additionally, Openshift natively supports Perses for dashboards instead of grafana which is supported with our vanilla k8s deployment guide. ## Prerequisites Before enabling Prometheus integration, ensure you have: + - Ensure you have enabled and configured the openshift-user-workload-monitoring - Have labeled the kube-amd-gpu namespace with `openshift.io/cluster-monitoring=true` - The Device Metrics Exporter enabled in your GPU Operator deployment. @@ -80,18 +81,19 @@ metricsExporter: - **bearerTokenFile**: (Deprecated) Path to a file containing the bearer token for authentication. Retained for legacy use case. Use authorization block instead to pass tokens. - **authorization**: Configures token-based authorization. Reference to the token stored in a Kubernetes Secret - **tlsConfig**: Configures TLS for secure connections: - - **insecureSkipVerify**: When true, skips certificate verification (not recommended for production) - - **serverName**: Server name used for certificate validation - - **ca**: ConfigMap containing the CA certificate for server verification - - **cert**: Secret containing the client certificate for mTLS - - **keySecret**: Secret containing the client key for mTLS - - **caFile/certFile/keyFile**: File equivalents for certificates/keys mounted in Prometheus pod. + - **insecureSkipVerify**: When true, skips certificate verification (not recommended for production) + - **serverName**: Server name used for certificate validation + - **ca**: ConfigMap containing the CA certificate for server verification + - **cert**: Secret containing the client certificate for mTLS + - **keySecret**: Secret containing the client key for mTLS + - **caFile/certFile/keyFile**: File equivalents for certificates/keys mounted in Prometheus pod. These options allow secure metrics collection from AMD Device Metrics Exporter endpoints that are protected by the kube-rbac-proxy sidecar for authentication/authorization. ## Accessing Metrics with Openshift integrated Prometheus Upon applying the DeviceConfig with the correct settings, the GPU Operator automatically: + - Deploys the ServiceMonitor resource in the GPU Operator namespace. - Sets the required labels and namespace selectors in ServiceMonitor CR for Prometheus discovery. @@ -108,8 +110,9 @@ TODO When Prometheus scrapes targets defined by a `ServiceMonitor`, it automatically attaches labels to the metrics based on the target's metadata. One such label is `pod`, which identifies the Pod being scraped (in this case, the metrics exporter Pod itself). This creates a conflict: -1. **Exporter Metric Label:** `pod=""` (Indicates the actual GPU user) -2. **Prometheus Target Label:** `pod=""` (Indicates the source of the metric) + +1. **Exporter Metric Label:** `pod=""` (Indicates the actual GPU user) +2. **Prometheus Target Label:** `pod=""` (Indicates the source of the metric) ### Solution 1: `honorLabels: true` (Default) diff --git a/docs/metrics/prometheus.md b/docs/metrics/prometheus.md index b5001e74e..054bb3d0c 100644 --- a/docs/metrics/prometheus.md +++ b/docs/metrics/prometheus.md @@ -7,6 +7,7 @@ Prometheus integration is managed via the **ServiceMonitor** configuration in th ## Prerequisites Before enabling Prometheus integration, ensure you have: + - A running instance of the Prometheus Operator in your Kubernetes cluster. - The Device Metrics Exporter enabled in your GPU Operator deployment. - Properly configured kube-rbac-proxy in the DeviceConfig CR if the exporter endpoint is protected (Optional). @@ -72,18 +73,19 @@ metricsExporter: - **bearerTokenFile**: (Deprecated) Path to a file containing the bearer token for authentication. Retained for legacy use case. Use authorization block instead to pass tokens. - **authorization**: Configures token-based authorization. Reference to the token stored in a Kubernetes Secret - **tlsConfig**: Configures TLS for secure connections: - - **insecureSkipVerify**: When true, skips certificate verification (not recommended for production) - - **serverName**: Server name used for certificate validation - - **ca**: ConfigMap containing the CA certificate for server verification - - **cert**: Secret containing the client certificate for mTLS - - **keySecret**: Secret containing the client key for mTLS - - **caFile/certFile/keyFile**: File equivalents for certificates/keys mounted in Prometheus pod. + - **insecureSkipVerify**: When true, skips certificate verification (not recommended for production) + - **serverName**: Server name used for certificate validation + - **ca**: ConfigMap containing the CA certificate for server verification + - **cert**: Secret containing the client certificate for mTLS + - **keySecret**: Secret containing the client key for mTLS + - **caFile/certFile/keyFile**: File equivalents for certificates/keys mounted in Prometheus pod. These options allow secure metrics collection from AMD Device Metrics Exporter endpoints that are protected by the kube-rbac-proxy sidecar for authentication/authorization. ## Accessing Metrics with Prometheus Upon applying the DeviceConfig with the correct settings, the GPU Operator automatically: + - Deploys the ServiceMonitor resource in the GPU Operator namespace. - Sets the required labels and namespace selectors in ServiecMonitor CR for Prometheus discovery. @@ -98,16 +100,17 @@ These selectors help Prometheus identify the correct ServiceMonitor to use in th The [ROCm/device-metrics-exporter](https://github.com/ROCm/device-metrics-exporter/tree/main/grafana) repository includes Grafana dashboards designed to visualize the exported metrics, particularly focusing on job-level or pod-level GPU usage. More details can be found in [Device Metrics Exporter Grafana](https://instinct.docs.amd.com/projects/device-metrics-exporter/en/main/integrations/prometheus-grafana.html#prometheus-and-grafana-integration). These dashboards rely on specific labels exported by the metrics exporter, such as: -* `pod`: The name of the kubernetes workload Pod currently utilizing the GPU. -* `job_id`: An identifier for the slurm job associated with the workload. +- `pod`: The name of the kubernetes workload Pod currently utilizing the GPU. +- `job_id`: An identifier for the slurm job associated with the workload. ### The `pod` Label Conflict When Prometheus scrapes targets defined by a `ServiceMonitor`, it automatically attaches labels to the metrics based on the target's metadata. One such label is `pod`, which identifies the Pod being scraped (in this case, the metrics exporter Pod itself). This creates a conflict: -1. **Exporter Metric Label:** `pod=""` (Indicates the actual GPU user) -2. **Prometheus Target Label:** `pod=""` (Indicates the source of the metric) + +1. **Exporter Metric Label:** `pod=""` (Indicates the actual GPU user) +2. **Prometheus Target Label:** `pod=""` (Indicates the source of the metric) ### Solution 1: `honorLabels: true` (Default) diff --git a/docs/npd/node-problem-detector.md b/docs/npd/node-problem-detector.md index c57072082..fd4474f3d 100644 --- a/docs/npd/node-problem-detector.md +++ b/docs/npd/node-problem-detector.md @@ -1,3 +1,4 @@ + # Node Problem Detector Integration Node-problem-detector(NPD) aims to make various node problems visible to the upstream layers in the cluster management stack. It is a daemon that runs on each node, detects node problems and reports them to apiserver. NPD can be extended to detect AMD GPU problems. @@ -13,7 +14,8 @@ Custom plugin monitor is a plugin mechanism for node-problem-detector. It will e Exit codes 0, 1, and 2 are used for plugin monitor. Exit code 0 is treated as working state. Exit code 1 is treated as problem state. Exit code 2 is used for any unknown error. When plugin monitor detects exit code 1, it sets NodeCondition based on the rules defined in custom plugin monitor config file ## Node-Problem-Detector Integration -We provide a small utility, `amdgpuhealth`, queries various AMD GPU metrics from `device-metrics-exporter` and `Prometheus` endpoint. Based on user-configured thresholds, it determines if any AMD GPU is in problem state. NPD custom plugin monitor can invoke this program at configurable intervals to monitor various metrics and assess overall health of AMD GPUs. + +We provide a small utility, `amdgpuhealth`, queries various AMD GPU metrics from `device-metrics-exporter` and `Prometheus` endpoint. Based on user-configured thresholds, it determines if any AMD GPU is in problem state. NPD custom plugin monitor can invoke this program at configurable intervals to monitor various metrics and assess overall health of AMD GPUs. The utility `amdgpuhealth` is packaged with device-metrics-exporter docker image and will be copied to host path `/var/lib/amd-metrics-exporter`. NPD needs to mount this host path to be able to use the utility via custom plugin monitor. @@ -28,6 +30,7 @@ Example usage of amdgpuhealth CLI: In the above examples, the program queries either a counter or gauge metric. You can define a threshold for each metric. If the reported AMD GPU metric value exceeds the threshold, `amdgpuhealth` prints an error message to standard output and exits with code 1. The NPD plugin uses this exit code and output to update the node condition's status and message respectively, indicating problem with AMD GPU. Example custom plugin monitor config: + ```json { "plugin": "custom", @@ -71,8 +74,8 @@ Example custom plugin monitor config: "gauge-metric", "-m=GPUMetricField_GPU_EDGE_TEMPERATURE", "-t=100", - "-d=1h", - "--prometheus-endpoint=http://localhost:9090" + "-d=1h", + "--prometheus-endpoint=http://localhost:9090" ], "timeout": "10s" } @@ -92,13 +95,13 @@ If your AMD Device Metrics Exporter or Prometheus endpoints require token-based You can create a Kubernetes Secret to store the token for the AMD Device Metrics Exporter endpoint in two ways: -**From a file:** +- From a file: ```bash kubectl create secret generic -n amd-exporter-auth-token --from-file=token= ``` -**From a string literal** +- From a string literal: ```bash kubectl create secret genreic -n amd-exporter-auth-token --from-literal=token= @@ -118,25 +121,24 @@ Mount this secret as a volume in your NPD deployment yaml. The same path must be "counter-metric", "-m=GPU_ECC_UNCORRECT_UMC", "-t=1", - "--exporter-bearer-token=" + "--exporter-bearer-token=" ], "timeout": "10s" } ] ``` - 2. **Creating a Authorization token Secret for Prometheus endpoint:** Similarly create secret for Prometheus endpoint. This will be needed for gauge metrics -**From a file** +- From a file: ```bash kubectl create secret generic -n prometheus-auth-token --from-file=token= ``` -**From a string literal** +- From a string literal: ```bash kubectl create secret genreic -n prometheus-auth-token --from-literal=token= @@ -172,6 +174,7 @@ For TLS, NPD needs to have server endpoint's Root CA certificate to authenticate 1. **Creating Secret for AMD Device Metrics Exporter endpoint Root CA** Please make sure the key in the secret is set to `ca.crt` + ```bash kubectl create secret generic -n amd-exporter-rootca --from-file=ca.crt= ``` @@ -226,7 +229,7 @@ Mount this secret as a volume in your NPD deployment yaml. Pass the mount path i For mTLS, NPD needs to have a certificate and it's corresponding private key. Certificate information can be stored as Kubernetes TLS Secret and mounted as colume in the NPD pod. -1. **Creating Secret for NPD identity certificate** +3. **Creating Secret for NPD identity certificate** Please make sure you use the keys `tls.crt` and `tls.key` for certificate and key respectively @@ -249,7 +252,7 @@ Mount the secret as a volume in your NPD deployment yaml. Pass the mount path as "-m=GPU_ECC_UNCORRECT_UMC", "-t=1", "--exporter-root-ca=", - "--client-cert=" + "--client-cert=" ], "timeout": "10s" }, @@ -264,7 +267,7 @@ Mount the secret as a volume in your NPD deployment yaml. Pass the mount path as "-m=GPUMetricField_GPU_EDGE_TEMPERATURE", "-t=100", "--prometheus-root-ca=/ca.crt", - "--client-cert=" + "--client-cert=" ], "timeout": "10s" } diff --git a/docs/overview.md b/docs/overview.md index 264185515..4a9069454 100644 --- a/docs/overview.md +++ b/docs/overview.md @@ -92,7 +92,7 @@ The [Device Config Manager](https://github.com/ROCm/device-config-manager) is u - DCM will be handling the GPU partitioning configurations - Different partition types supported are: - - Memory partitions (NPS1, NPS2, NPS4) - - Compute partitions (SPX, DPX, QPX, CPX) + - Memory partitions (NPS1, NPS2, NPS4) + - Compute partitions (SPX, DPX, QPX, CPX) - Supports Systemd integration to start/stop service files - Report partition results as Kubernetes events. diff --git a/docs/releasenotes.md b/docs/releasenotes.md index 3a0ff3912..8f21c9088 100644 --- a/docs/releasenotes.md +++ b/docs/releasenotes.md @@ -10,9 +10,10 @@ The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 - Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR `spec.driver.version`, users should reference the amdgpu driver version (e.g., "30.20") for ROCm 7.1 and later releases. For ROCm versions prior to 7.1, continue to use the ROCm version number (e.g., "6.4", "7.0"). Please refer to the [AMD ROCm documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html) for the driver version that corresponds to your desired ROCm release. All published amdgpu driver versions are available at [Radeon Repository](https://repo.radeon.com/amdgpu/). ### Release Highlights + - **OpenShift Platform Support Enhancements** - **Build Driver Images Directly within Disconnected OpenShift Clusters** - - Starting from v1.4.1, the AMD GPU Operator supports building driver kernel modules directly within disconnected OpenShift clusters. + - Starting from v1.4.1, the AMD GPU Operator supports building driver kernel modules directly within disconnected OpenShift clusters. - For Red Hat Enterprise Linux CoreOS (used by OpenShift), OpenShift will download source code and firmware from AMD provided [amdgpu-driver images](https://hub.docker.com/r/rocm/amdgpu-driver) into their [DriverToolKit](https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/specialized_hardware_and_driver_enablement/driver-toolkit) and directly build the kernel modules from source code without dependency on lots of RPM packages. - **Cluster Monitoring Enablement** - The v1.4.1 AMD GPU Operator automatically creates the RBAC resources required by the OpenShift [Cluster Monitoring stack](https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#configuring-prometheus-to-scrape-metrics). This reduces one manual configuration steps when setting up the OpenShift monitoring stack to scrape metrics from the device metrics exporter. @@ -28,10 +29,11 @@ The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 - Test runner Kubernetes events now include additional information such as pod UID and test framework name (e.g., RVS, AGFHC) as event labels, providing more comprehensive test run information for improved tracking and diagnostics. ### Fixes + 1. **Node Feature Discovery Rule Fix** - * Fixed the PCI device ID for the Virtual Function (VF) of MI308X and MI300X-HF GPUs + - Fixed the PCI device ID for the Virtual Function (VF) of MI308X and MI300X-HF GPUs 2. **Helm Chart default DeviceConfig Fix** - * Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via `values.yaml` or the `--set` option. + - Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via `values.yaml` or the `--set` option. ### Known Limitations @@ -81,21 +83,23 @@ The AMD GPU Operator v1.4.0 adds MI35X platform support and updates all managed - GPU_JPEG_BUSY_INSTANTANEOUS - GPU_VCN_BUSY_INSTANTANEOUS - ### Platform Support - - Validated for vanilla kubernetes 1.32, 1.33 + +- Validated for vanilla kubernetes 1.32, 1.33 ### Fixes + 1. **Failed to load GPU Operator managed amdgpu kernel module on Ubuntu 24.04** - * When users are using GPU Operator to build and manage the amdgpu kernel module, it may fail on the Ubuntu 24.04 worker nodes if the node doesn't have `linux-modules-extra-$(uname -r)` installed. - * This issue was fixed by this release, `linux-modules-extra-$(uname -r)` won't be required to be installed on the worker node. + - When users are using GPU Operator to build and manage the amdgpu kernel module, it may fail on the Ubuntu 24.04 worker nodes if the node doesn't have `linux-modules-extra-$(uname -r)` installed. + - This issue was fixed by this release, `linux-modules-extra-$(uname -r)` won't be required to be installed on the worker node. 2. **Improved Test Runner Result Handling** - * Previously, if some test cases in a recipe were skipped while others passed, the test runner would incorrectly mark the entire recipe as failed. - * Now, the test runner marks the recipe as passed if at least some test cases pass. If all test cases are skipped, the recipe is marked as skipped. + - Previously, if some test cases in a recipe were skipped while others passed, the test runner would incorrectly mark the entire recipe as failed. + - Now, the test runner marks the recipe as passed if at least some test cases pass. If all test cases are skipped, the recipe is marked as skipped. 3. **Device Config Manager keeps retrying and waiting for unsupported memory partition type** - * This issue has been fixed, currently if users provide unsupported memory partitions for the GPU model, DMC would immediately fail the workflow and won't keep retrying on unsupported memory partition. + - This issue has been fixed, currently if users provide unsupported memory partitions for the GPU model, DMC would immediately fail the workflow and won't keep retrying on unsupported memory partition. ### Known Limitations +> > **Note:** All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: [Known Issues and Limitations](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html). Please refer to this page regularly for the most up-to-date information. @@ -142,7 +146,7 @@ The AMD GPU Operator v1.3.1 release extends platform support to OpenShift v4.19 1. **Test Runner pod restart failure with a GPU partition profile change** - Previously in v1.3.0 when users disabled test runner that cut down the ongoing test, then enabled it again with an underlying GPU partition profile change, the test runner would possibly fail to restart due to the device ID change caused by partition profile change. - - This has been fixed in v1.3.1 release. + - This has been fixed in v1.3.1 release. 2. **Device Config Manager memory partition failure when the driver was installed by Kernel Module Management (KMM) Operator** - Previously in v1.3.0 if users worker nodes have no inbox/pre-installed amdgpu driver (ROCm 6.4+) and users install the driver via the KMM operator, the memory partition configuration on Device Config Manager would fail @@ -185,7 +189,7 @@ The AMD GPU Operator v1.3.0 release introduces new features, most notably of whi ### Known Limitations -- The Device Config Manager is currently only supported on Kubernetes. We will be adding a Debian package to support bare metal installations in the next release of DCM. For the time being +- The Device Config Manager is currently only supported on Kubernetes. We will be adding a Debian package to support bare metal installations in the next release of DCM. For the time being 1. **The Device Config Manager requires running a docker container if you wish to run it in standalone mode (without Kubernetes).** @@ -241,7 +245,7 @@ The AMD GPU Operator v1.2.2 release introduces new features to support Device Me - *Root Cause*: - When node labeller pod launched it will remove all node labels within `amd.com` and `beta.amd.com` from current node then post the labels managed by itself. - When operator is executing the reconcile function, the removal of `DevicePlugin` or will remove all node labels under `amd.com` or `beta.amd.com` domain even if they are not managed by node labeller. - - *Resolution*: This issue has been fixed in v1.2.2 for both operator and node labeller side. Users can upgrade to v1.2.2 operator helm chart and use latest node labeller image then only node labeller managed labels will be auto removed. Other users defined labels under `amd.com` or `beta.amd.com` won't be auto removed by operator or node labeller. + - *Resolution*: This issue has been fixed in v1.2.2 for both operator and node labeller side. Users can upgrade to v1.2.2 operator helm chart and use latest node labeller image then only node labeller managed labels will be auto removed. Other users defined labels under `amd.com` or `beta.amd.com` won't be auto removed by operator or node labeller. 3. **During automatic driver upgrade nodes can get stuck in reboot-in-progress** - *Issue*: When users upgrade the driver version by using `DeviceConfig` automatic upgrade feature with `spec.driver.upgradePolicy.enable=true` and `spec.driver.upgradePolicy.rebootRequired=true`, some nodes may get stuck at reboot-in-progress state. @@ -493,28 +497,28 @@ Not Applicable as this is the initial release. 1. **GPU operator driver installs only DKMS package** - - *Impact:* Applications which require ROCM packages will need to install respective packages. - - *Affected Configurations:* All configurations - - *Workaround:* None as this is the intended behaviour +- *Impact:* Applications which require ROCM packages will need to install respective packages. +- *Affected Configurations:* All configurations +- *Workaround:* None as this is the intended behaviour -2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install** +1. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install** - *Impact:* Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg - *Affected configurations:* Nodes with driver version >= ROCm 6.2.x - *Workaround:* Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+ -3. **GPU Operator unable to install amdgpu driver if existing driver is already installed** +2. **GPU Operator unable to install amdgpu driver if existing driver is already installed** - *Impact:* Driver install will fail if amdgpu in-box Driver is present/already installed - *Affected Configurations:* All configurations - *Workaround:* When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://instinct.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver -4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server** +3. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server** - *Impact:* Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU. - *Affected Configurations:* All configurations - *Workaround:* Restart the Device plugin pod deployed. -5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** +4. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** - *Impact:* Node upgrade will not proceed automatically and requires manual intervention - *Affected Configurations:* All configurations - *Workaround:* Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: @@ -523,7 +527,7 @@ Not Applicable as this is the initial release. kubectl cordon ``` -6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module** +5. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module** - *Impact:* Driver upgrade is blocked - *Affected Configurations:* All configurations - *Workaround:* Disable the Metrics Exporter on specific node to allow driver upgrade as follows: diff --git a/docs/slinky/slinky-example.md b/docs/slinky/slinky-example.md index ae6e505d7..0f6094686 100644 --- a/docs/slinky/slinky-example.md +++ b/docs/slinky/slinky-example.md @@ -28,9 +28,9 @@ helm repo add bitnami https://charts.bitnami.com/bitnami helm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \ - --namespace cert-manager --create-namespace --set crds.enabled=true + --namespace cert-manager --create-namespace --set crds.enabled=true helm install prometheus prometheus-community/kube-prometheus-stack \ - --namespace prometheus --create-namespace --set installCRDs=true + --namespace prometheus --create-namespace --set installCRDs=true ``` ## Installing Slinky Operator @@ -62,7 +62,6 @@ You will need to build a Slurm docker image to be used for the Slurm compute nod - the `COPY patches/ patches/` line has been commented out as there are currently no patches to be applied - the `COPY --from=build /tmp/*.deb /tmp/` has also been commented out as there are no .deb files to copy - ## Installing Slurm Cluster Once the image has been built and pushed to a repository update the `values-slurm.yaml` file to specify the compute node image you will be using: diff --git a/docs/specialized_networks/airgapped-install-openshift.md b/docs/specialized_networks/airgapped-install-openshift.md index 83326c102..301a98997 100644 --- a/docs/specialized_networks/airgapped-install-openshift.md +++ b/docs/specialized_networks/airgapped-install-openshift.md @@ -4,8 +4,8 @@ This guide explains how to install the AMD GPU Operator in an air-gapped environ ## Prerequisites -1. OpenShift 4.16+ -2. Users should have followed the [OpenShift Official Documentation](https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/disconnected_environments/mirroring-in-disconnected-environments) to install the air-gapped cluster and set up a Mirror Registry. +- OpenShift 4.16+ +- Users should have followed the [OpenShift Official Documentation](https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/disconnected_environments/mirroring-in-disconnected-environments) to install the air-gapped cluster and set up a Mirror Registry. ![Air-gapped Installation Diagram](../_static/ocp_airgapped.png) @@ -85,9 +85,9 @@ mirror: helm: {} ``` -3. After mirroring setup, users should have installed NFD and KMM, and enabled the internal image registry in the air-gapped cluster. See [OpenShift OLM Installation](../installation/openshift-olm.md#configure-internal-registry) for details. +- After mirroring setup, users should have installed NFD and KMM, and enabled the internal image registry in the air-gapped cluster. See [OpenShift OLM Installation](../installation/openshift-olm.md#configure-internal-registry) for details. -4. Users should have installed the AMD GPU Operator in the air-gapped cluster without creating a `DeviceConfig`. +- Users should have installed the AMD GPU Operator in the air-gapped cluster without creating a `DeviceConfig`. ## Installation Steps @@ -109,14 +109,16 @@ Build the pre-compiled driver image in a build cluster that has internet access After successfully pushing the driver image, save it by running: -* If you are using OpenShift internal registry +- If you are using OpenShift internal registry + ```bash podman login -u deployer -p $(oc create token deployer) image-registry.openshift-image-registry.svc:5000 podman pull image-registry.openshift-image-registry.svc:5000/default/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0 podman save image-registry.openshift-image-registry.svc:5000/default/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0 -o driver-image.tar ``` -* If you are using other image registry +- If you are using other image registry + ```bash podman login -u username -p password/token registry.example.com podman pull registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0 @@ -125,27 +127,31 @@ podman save registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x8 ### 2. Import Pre-compiled Driver Image -A. Import images +A. Import images ```{Note} 1. This step is for using the pre-compiled driver image within the OpenShift internal registry (the OpenShift built-in image registry, not the mirror registry for air-gapped installation). 2. Users who have already pushed the pre-compiled driver image to another registry don't need to manually load it into the internal registry. Skip to step 3 and specify the image URL in `spec.driver.image`. ``` -* Import pre-compiled driver image +- Import pre-compiled driver image After copying the image files to the air-gapped cluster, switch to the air-gapped cluster and use podman to load the image, re-tag if needed, then push the image to the desired image registry: - * Load the image file: `podman load -i driver-image.tar` - * Re-tag if needed: `podman tag `. Remember to tag the image to the GPU operator's namespace. For example, if using the GPU operator in `openshift-amd-gpu`, tag the image to `image-registry.openshift-image-registry.svc:5000/openshift-amd-gpu/amdgpu_kmod`. - * Use podman to log in to the image registry if needed. For OpenShift internal registry: + +- Load the image file: `podman load -i driver-image.tar` +- Re-tag if needed: `podman tag `. Remember to tag the image to the GPU operator's namespace. For example, if using the GPU operator in `openshift-amd-gpu`, tag the image to `image-registry.openshift-image-registry.svc:5000/openshift-amd-gpu/amdgpu_kmod`. +- Use podman to log in to the image registry if needed. For OpenShift internal registry: + ```bash podman login -u builder -p $(oc create token builder) image-registry.openshift-image-registry.svc:5000 ``` - * Push the image: `podman push ` + +- Push the image: `podman push ` B. Verify that the required images are located in the internal registry. For example, if using the internal registry: + ```bash $ oc get is -n openshift-amd-gpu NAME IMAGE REPOSITORY TAGS UPDATED @@ -155,7 +161,7 @@ amdgpu_kmod image-registry.openshift-image-registry.svc:5000/opens ### 3. Deployment of DeviceConfig in Air-gapped Environment A. If pre-compiled driver images are present, the operator will directly pull and use the pre-compiled driver image. -B. If pre-compiled driver images are not present, the operator will build the kernel module based on the mirrored source image, which was previously mirrored from `docker.io/rocm/amdgpu-driver`. +B. If pre-compiled driver images are not present, the operator will build the kernel module based on the mirrored source image, which was previously mirrored from `docker.io/rocm/amdgpu-driver`. ```yaml apiVersion: amd.com/v1alpha1 diff --git a/docs/test/agfhc.md b/docs/test/agfhc.md index f6e281d58..f9712ae73 100644 --- a/docs/test/agfhc.md +++ b/docs/test/agfhc.md @@ -1,5 +1,6 @@ # AGFHC (AMD GPU Field Health Check) Support -AGFHC provides command-line interface for running GPU health checks, delivering consistent PASS/FAIL results with detailed logs and results.json for failures. Tests can be grouped into recipes (e.g., short or extended runs) to cover common scenarios, and updates to AGFHC improve coverage without requiring user changes. + +AGFHC provides command-line interface for running GPU health checks, delivering consistent PASS/FAIL results with detailed logs and results.json for failures. Tests can be grouped into recipes (e.g., short or extended runs) to cover common scenarios, and updates to AGFHC improve coverage without requiring user changes. In addition to RVS (ROCm Validation Suite), the test runner also supports AGFHC (AMD GPU Field Health Check) to ensure the health of AMD GPUs in production environments. The test runner image leverages AGFHC in a containerized environment to simplify execution and deployment. @@ -11,11 +12,12 @@ In addition to RVS (ROCm Validation Suite), the test runner also supports AGFHC 3. To access the full test runner image, which includes both RVS and the AGFHC toolkit, please contact your AMD representative to complete the authorization process. ``` -# Triggering AGFHC Tests +## Triggering AGFHC Tests To support more than one test framework, the test runner allows you to specify the test framework in the `config.json` file. Example Config Map to use AGFHC test framework: + ```yaml apiVersion: v1 kind: ConfigMap @@ -50,6 +52,7 @@ data: } } ``` + The default framework is RVS if not specified, but you can switch to AGFHC by setting the `Framework` field to `AGFHC` in the `TestCases` section of the `config.json`. The `Recipe` field specifies the test suite to run from the specified framework. You can supply additional optional arguments to the test cases using the `Arguments` field. At present, only 1 `testcase` can be run at a time. Please refer to the AGFHC documentation for available test recipes and additional configuration options. @@ -59,54 +62,54 @@ Please refer to the AGFHC documentation for available test recipes and additiona Here is the AGFHC test recipe support matrix and brief introduction to each recipe: | GPU | all_lvl1 | all_lvl2 | all_lvl3 | all_lvl4 | all_lvl5 | all_perf | single_pass | gfx_lvl1 | gfx_lvl2 | gfx_lvl3 | gfx_lvl4 | hbm_lvl1 | hbm_lvl2 | hbm_lvl3 | hbm_lvl4 | hbm_lvl5 | dma_lvl1 | dma_lvl2 | dma_lvl3 | dma_lvl4 | hsio | pcie_lvl1 | pcie_lvl2 | pcie_lvl3 | pcie_lvl4 | rochpl_isolation | thermal | xgmi_lvl1 | xgmi_lvl2 | xgmi_lvl3 | xgmi_lvl4 | all_burnin_4h | all_burnin_12h | all_burnin_24h | hbm_burnin_8h | hbm_burnin_24h | -|-----------|----------|----------|----------|----------|----------|----------|-------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|------|-----------|-----------|-----------|-----------|------------------|---------|-----------|-----------|-----------|-----------|---------------|----------------|----------------|---------------|----------------| -| MI300A | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | | | | | ✓ | | ✓ | ✓ | ✓ | ✓ | | | | | | -| MI300X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| MI300X-HF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| MI308X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| MI308X-HF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| MI325X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -| MI350X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | -| MI355X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | +| --------- | -------- | -------- | -------- | -------- | -------- | -------- | ----------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | ---- | --------- | --------- | --------- | --------- | ---------------- | ------- | --------- | --------- | --------- | --------- | ------------- | -------------- | -------------- | ------------- | -------------- | +| MI300A | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | | | | | ✓ | | ✓ | ✓ | ✓ | ✓ | | | | | | +| MI300X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI300X-HF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI308X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI308X-HF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI325X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI350X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | +| MI355X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | | Name | Title | -|--------------------|-----------------------------------| -| all_burnin_12h | A \~12h check across system | -| all_burnin_24h | A \~24h check across system | -| all_burnin_4h | A \~4h check across system | -| all_lvl1 | A \~5m check across system | -| all_lvl2 | A \~10m check across system | -| all_lvl3 | A \~30m check across system | -| all_lvl4 | A \~1h check across system | -| all_lvl5 | A \~2h check across system | -| all_perf | Run all performance based tests | -| dma_lvl1 | A \~5m DMA workload | +| ------------------ | --------------------------------- | +| all_burnin_12h | A \~12h check across system | +| all_burnin_24h | A \~24h check across system | +| all_burnin_4h | A \~4h check across system | +| all_lvl1 | A \~5m check across system | +| all_lvl2 | A \~10m check across system | +| all_lvl3 | A \~30m check across system | +| all_lvl4 | A \~1h check across system | +| all_lvl5 | A \~2h check across system | +| all_perf | Run all performance based tests | +| dma_lvl1 | A \~5m DMA workload | | dma_lvl2 | A \~10m DMA workload | | dma_lvl3 | A \~30m DMA workload | | dma_lvl4 | A \~1h DMA workload | -| gfx_lvl1 | A \~5m GFX workload | +| gfx_lvl1 | A \~5m GFX workload | | gfx_lvl2 | A \~10m GFX workload | | gfx_lvl3 | A \~30m GFX workload | | gfx_lvl4 | A \~1h GFX workload | -| hbm_burnin_24h | A \~24h extended hbm test | -| hbm_burnin_8h | A \~8h extended hbm test | -| hbm_lvl1 | A \~5m HBM workload | +| hbm_burnin_24h | A \~24h extended hbm test | +| hbm_burnin_8h | A \~8h extended hbm test | +| hbm_lvl1 | A \~5m HBM workload | | hbm_lvl2 | A \~10m HBM workload | | hbm_lvl3 | A \~30m HBM workload | | hbm_lvl4 | A \~1h HBM workload | | hbm_lvl5 | A \~2h HBM workload | -| hsio | Run all HSIO tests once | -| pcie_lvl1 | A \~5m PCIe workload | -| pcie_lvl2 | A \~10m PCIe workload | -| pcie_lvl3 | A \~30m PCIe workload | -| pcie_lvl4 | A \~1h PCIe workload | -| rochpl_isolation | Run rocHPL on each GPU | -| single_pass | Run all tests once | -| thermal | Verify thermal solution | -| xgmi_lvl1 | A \~5m xGMI workload | -| xgmi_lvl2 | A \~10m xGMI workload | -| xgmi_lvl3 | A \~30m xGMI workload | -| xgmi_lvl4 | A \~1h xGMI workload | +| hsio | Run all HSIO tests once | +| pcie_lvl1 | A \~5m PCIe workload | +| pcie_lvl2 | A \~10m PCIe workload | +| pcie_lvl3 | A \~30m PCIe workload | +| pcie_lvl4 | A \~1h PCIe workload | +| rochpl_isolation | Run rocHPL on each GPU | +| single_pass | Run all tests once | +| thermal | Verify thermal solution | +| xgmi_lvl1 | A \~5m xGMI workload | +| xgmi_lvl2 | A \~10m xGMI workload | +| xgmi_lvl3 | A \~30m xGMI workload | +| xgmi_lvl4 | A \~1h xGMI workload | NOTE: Each one of the aforementioned recipes could consist of multiple test cases. Execution of _individual_ AGFHC test case is currently not supported. @@ -115,47 +118,47 @@ NOTE: Each one of the aforementioned recipes could consist of multiple test case The Instinct GPU models could be configured with certain GPU partition profiles to execute AGFHC tests, the supported partition profiles are: | GPU Model | Compute Partition | Memory Partition | Number of GPUs for testing | -|-------------|------------------|------------------|------------------------| -| mi300a | SPX | NPS1 | 1 | -| mi300a | SPX | NPS1 | 2 | -| mi300a | SPX | NPS1 | 4 | -| mi300x | SPX | NPS1 | 1 | -| mi300x | SPX | NPS1 | 8 | -| mi308x | SPX | NPS1 | 1 | -| mi308x | SPX | NPS1 | 8 | -| mi325x | SPX | NPS1 | 1 | -| mi325x | SPX | NPS1 | 8 | -| mi308x-hf | SPX | NPS1 | 1 | -| mi308x-hf | SPX | NPS1 | 8 | -| mi300x-hf | SPX | NPS1 | 1 | -| mi300x-hf | SPX | NPS1 | 8 | -| mi350x | SPX | NPS1 | 1 | -| mi350x | SPX | NPS1 | 8 | -| mi355x | SPX | NPS1 | 1 | -| mi355x | SPX | NPS1 | 8 | +| ----------- | ----------------- | ---------------- | -------------------------- | +| mi300a | SPX | NPS1 | 1 | +| mi300a | SPX | NPS1 | 2 | +| mi300a | SPX | NPS1 | 4 | +| mi300x | SPX | NPS1 | 1 | +| mi300x | SPX | NPS1 | 8 | +| mi308x | SPX | NPS1 | 1 | +| mi308x | SPX | NPS1 | 8 | +| mi325x | SPX | NPS1 | 1 | +| mi325x | SPX | NPS1 | 8 | +| mi308x-hf | SPX | NPS1 | 1 | +| mi308x-hf | SPX | NPS1 | 8 | +| mi300x-hf | SPX | NPS1 | 1 | +| mi300x-hf | SPX | NPS1 | 8 | +| mi350x | SPX | NPS1 | 1 | +| mi350x | SPX | NPS1 | 8 | +| mi355x | SPX | NPS1 | 1 | +| mi355x | SPX | NPS1 | 8 | ### AGFHC arguments As for the AGFHC arguments, please refer to AGFHC official documents for the full list of available arguments. Here is a list of frequently used arguments: -| Argument | Description | Default/Example | -|----------------------------|-------------------------------------------------------------------------------------------------------|--------------------------------------------------| -| `--update-interval UPDATE_INTERVAL` | Set the interval to print elapsed timing updates on the console. | `--update-interval 20s` - updates every 20s | -| `--sysmon-interval SYSMON_INTERVAL` | Set to update the default sysmon interval | | -| `--tar-logs` | Generate a tar file of all logs | | -| `--disable-sysmon` | Set to disable system monitoring data collection. | Default: enabled | -| `--disable-numa-control` | Set to disable control of numa balancing. | Default: enabled | -| `--disable-ras-checks` | Set to disable ras checks. | Default: enabled | -| `--disable-bad-pages-checks` | Set to disable bad pages checks. | Default: enabled | -| `--disable-dmesg-checks` | Set to disable dmesg checks. | Default: enabled | -| `--ignore-dmesg` | Set to ignore dmesg fails, logs will still be created. | Default: dmesg fails enabled | -| `--ignore-ras` | Set to ignore ras fails, logs will still be created. | Default: ras fails enabled | -| `--ignore-performance` | Set to ignore performance to skip the performance analysis and perform only RAS/dmesg checks. | Default: performance analysis enabled | -| `--known-dmesg-only` | Do not fail on any unknown dmesg, but mark them as expected. | Default: any unknown dmesg fails | -| `--disable-hsio-gather` | Set to disable hsio gather. | Default: enabled | -| `--exit-on-failure`, `-e` | Exits the execution of test cases on failure of a test, marking remaining as skipped. | Default: keep running without exiting on failure | - -# Kubernetes events +| Argument | Description | Default/Example | +| ------------------------------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------ | +| `--update-interval UPDATE_INTERVAL` | Set the interval to print elapsed timing updates on the console. | `--update-interval 20s` - updates every 20s | +| `--sysmon-interval SYSMON_INTERVAL` | Set to update the default sysmon interval | | +| `--tar-logs` | Generate a tar file of all logs | | +| `--disable-sysmon` | Set to disable system monitoring data collection. | Default: enabled | +| `--disable-numa-control` | Set to disable control of numa balancing. | Default: enabled | +| `--disable-ras-checks` | Set to disable ras checks. | Default: enabled | +| `--disable-bad-pages-checks` | Set to disable bad pages checks. | Default: enabled | +| `--disable-dmesg-checks` | Set to disable dmesg checks. | Default: enabled | +| `--ignore-dmesg` | Set to ignore dmesg fails, logs will still be created. | Default: dmesg fails enabled | +| `--ignore-ras` | Set to ignore ras fails, logs will still be created. | Default: ras fails enabled | +| `--ignore-performance` | Set to ignore performance to skip the performance analysis and perform only RAS/dmesg checks. | Default: performance analysis enabled | +| `--known-dmesg-only` | Do not fail on any unknown dmesg, but mark them as expected. | Default: any unknown dmesg fails | +| `--disable-hsio-gather` | Set to disable hsio gather. | Default: enabled | +| `--exit-on-failure`, `-e` | Exits the execution of test cases on failure of a test, marking remaining as skipped. | Default: keep running without exiting on failure | + +## Kubernetes events Upon successful execution of the AGFHC test recipe, the results are output as Kubernetes events. You can view these events using the following command: @@ -166,11 +169,13 @@ LAST SEEN TYPE REASON OBJECT ``` If the test fails, the event will indicate a failure status. + ```bash $ kubectl get events LAST SEEN TYPE REASON OBJECT MESSAGE 63s Warning TestFailed pod/test-runner-manual-trigger-fs64h [{"number":1,"suitesResult":{"0":{"gfx_dgemm":"success","hbm_bw":"success","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"},"2":{"gfx_dgemm":"success","hbm_bw":"failure","pcie_bidi_peak":"success","pcie_link_status":"success","xgmi_a2a":"success"}},"status":"completed"}] ``` -# Log export -By default, test execution logs are saved to `/var/log/amd-test-runner/` on the host. Log export functionality is also supported, similar to RVS. AGFHC provides more detailed logs than RVS and all the logs provided by the framework are included in the tarball. \ No newline at end of file +## Log export + +By default, test execution logs are saved to `/var/log/amd-test-runner/` on the host. Log export functionality is also supported, similar to RVS. AGFHC provides more detailed logs than RVS and all the logs provided by the framework are included in the tarball. diff --git a/docs/test/appendix-test-recipe.md b/docs/test/appendix-test-recipe.md index 0e47de08d..e9215bfd3 100644 --- a/docs/test/appendix-test-recipe.md +++ b/docs/test/appendix-test-recipe.md @@ -51,5 +51,4 @@ Test recipes are available for GPUs with specific partition profiles. To use a p | `--parallel`, `-p` | Enables or Disables parallel execution across multiple GPUs, this will help accelerate the RVS tests. | By default if this option is not specified the test won't execute in parallel. Use `-p` or `-p true` to enable parallel execution or use `-p false` to disable the parallel execution. | | `--debug`, `-d` | Specify the debug level for the output log. The range is 0-5 with 5 being the highest verbose level.| Example: Use `-d 5` to get the highest level debug output. | - For more information of test recipe details and explanation, please check [RVS official documentation](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html). diff --git a/docs/test/auto-unhealthy-device-test.md b/docs/test/auto-unhealthy-device-test.md index da23a2fb4..0f98636f6 100644 --- a/docs/test/auto-unhealthy-device-test.md +++ b/docs/test/auto-unhealthy-device-test.md @@ -309,6 +309,7 @@ Config map explanation: * DeviceIDs (Only works for ```manual``` and ```pre-start-job-check``` test trigger): List of string for GPU 0-indexed ID. A selector to filter which GPU would run the test. For example, if there are 2 GPUs the GPU ID would be 0 and 1. To select GPU0 to run the test only, please configure the DeviceIDs: + ```yaml { "Recipe": "gst_single", diff --git a/docs/test/logs-export.md b/docs/test/logs-export.md index aa28d8a3f..bb9eb04d7 100644 --- a/docs/test/logs-export.md +++ b/docs/test/logs-export.md @@ -27,8 +27,9 @@ Alternatively secrets can be created using kubectl CLI command without base64 en For AWS S3, the secret captures user [access key](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key) information and AWS region of bucket. The secret should include the following keys:​ -- `aws_access_key_id`: Your AWS access key ID​ -- `aws_secret_access_key`: Your AWS secret access key​ + +- `aws_access_key_id`: Your AWS access key ID +- `aws_secret_access_key`: Your AWS secret access key - `aws_region`: The AWS region where your S3 bucket resides Example: diff --git a/docs/test/manual-test.md b/docs/test/manual-test.md index 4def1b107..991ca8b62 100644 --- a/docs/test/manual-test.md +++ b/docs/test/manual-test.md @@ -9,6 +9,7 @@ The RVS test recipes in the Test Runner are not compatible with partitioned GPUs ``` ## Use Case 1 - GPU is unhealthy on the node + When any GPU on a specific worker node is unhealthy, you can manually trigger a test / benchmark run on that worker node to check more details on the unhealthy state by mounting the AMD device related files or folders (`/dev/dri` and `/dev/kfd`) into the test runner container. The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: @@ -153,6 +154,7 @@ spec: ``` ## Use Case 2 - GPUs are healthy on the node + When all GPUs on a worker node are healthy, you can manually trigger a benchmark test by requesting GPU resources (`amd.com/gpu`) on that node, rather than mounting device files or folders. If other GPU workloads are running and resources are unavailable, the system will wait until enough resources are free before starting the test. The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: @@ -292,6 +294,7 @@ spec: When test is running: ```bash + $ kubectl get job NAME STATUS COMPLETIONS DURATION AGE test-runner-manual-trigger Running 0/1 31s 31s @@ -304,6 +307,7 @@ test-runner-manual-trigger-fnvhn 1/1 Running 0 65s When test is completed: ```bash + $ kubectl get job NAME STATUS COMPLETIONS DURATION AGE test-runner-manual-trigger Complete 1/1 6m10s 7m21s @@ -462,6 +466,7 @@ spec: When the job gets scheduled, the CronJob resource will show active jobs and the job and pod resources will be created. ```bash + $ kubectl get cronjob NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE test-runner-manual-trigger-cron-job-midnight 0 0 * * * False 1 2s 86s @@ -476,13 +481,17 @@ test-runner-manual-trigger-cron-job-midnight-28936820-kkqnj 1/1 Running ``` ## Check test running node labels + When the test is ongoing the corresponding label will be added to the node resource: ```"testrunner.amd.com.gpu_health_check.gst_single": "running"```, the test running label will be removed once the test completed. ## Check test result + Once the test run finishes, the Job's ```Status``` field captures the test run result. If the test run is successful, the status is marked as **Complete**. In case of failure, status is marked as **Failed**. ## Check test result event + The test runner generated event can be found from Job resource defined namespace. The event contains more granular details about the test run. It can be used to gather more information in case of failure. + ```bash $ kubectl get events -n kube-amd-gpu LAST SEEN TYPE REASON OBJECT MESSAGE @@ -492,6 +501,7 @@ LAST SEEN TYPE REASON OBJECT More detailed information about test result events can be found in [this section](./auto-unhealthy-device-test.md#check-test-result-event). ## Advanced Configuration - ConfigMap + You can create a config map to customize the test trigger and recipe configs. For the example config map and explanation please check [this section](./auto-unhealthy-device-test.md#advanced-configuration---configmap). After creating the config map, you can specify the volume and volume mount to mount the config map into test runner container. diff --git a/docs/test/pre-start-job-test.md b/docs/test/pre-start-job-test.md index 4326d3a04..b1aa8675e 100644 --- a/docs/test/pre-start-job-test.md +++ b/docs/test/pre-start-job-test.md @@ -17,6 +17,7 @@ The RVS test recipes in the Test Runner are not compatible with partitioned GPUs ``` ## Configure pre-start init container + The init container requires RBAC config to grant the pod access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources: ```yaml @@ -165,38 +166,48 @@ spec: ## Check test runner init container When test runner is running: -``` + +```bash $ kubectl get pod NAME READY STATUS RESTARTS AGE pytorch-gpu-deployment-7c6bb979f5-p2wlk 0/1 Init:0/1 0 2m52s ``` Check test runner container logs: -```$ kubectl logs pytorch-gpu-deployment-7c6bb979f5-p2wlk -c init-test-runner``` -When test runner is completed, the workload container started to run: +```bash +kubectl logs pytorch-gpu-deployment-7c6bb979f5-p2wlk -c init-test-runner ``` + +When test runner is completed, the workload container started to run: + +```bash $ kubectl get pod NAME READY STATUS RESTARTS AGE pytorch-gpu-deployment-7c6bb979f5-p2wlk 1/1 Running 0 7m46s ``` ## Check test running node labels + When the test is ongoing the corresponding label will be added to the node resource: ```"testrunner.amd.com.gpu_health_check.gst_single": "running"```, the test running label will be removed once the test completed. ## Check test result event + The test runner generated event can be found from Job resource defined namespace + ```bash $ kubectl get events -n kube-amd-gpu LAST SEEN TYPE REASON OBJECT MESSAGE 8m8s Normal TestFailed pod/test-runner-manual-trigger-c4hpw [{"number":1,"suitesResult":{"42924":{"gpustress-3000-dgemm-false":"success","gpustress-41000-fp32-false":"failure","gst-1215Tflops-4K4K8K-rand-fp8":"failure","gst-8096-150000-fp16":"success"}}}] ``` + More detailed information about test result events can be found in [this section](./auto-unhealthy-device-test.md#check-test-result-event). ## Advanced Configuration - ConfigMap + You can create a config map to customize the test trigger and recipe configs. For the example config map and explanation please check [this section](./manual-test.md#advanced-configuration---configmap). -After creating the config map, you can specify the volume and volume mount to mount the config map into test runner container. +After creating the config map, you can specify the volume and volume mount to mount the config map into test runner container. * In the config map the file name must be named as ```config.json``` * Within the test runner container the mount path should be ```/etc/test-runner/``` @@ -213,4 +224,4 @@ Here is an example of mounting the ```hostPath``` into the container, the key po * Mount the volume to a directory within test runner container * Use ```LOG_MOUNT_DIR``` environment variable to ask test runner to save logs into the mounted directory -The example of mounting the ```hostPath``` volume into test runner container can be found at [this section](./manual-test.md#advanced-configuration---logs-mount). \ No newline at end of file +The example of mounting the ```hostPath``` volume into test runner container can be found at [this section](./manual-test.md#advanced-configuration---logs-mount). diff --git a/docs/test/test-runner-overview.md b/docs/test/test-runner-overview.md index 27f92b6c8..250b72e0f 100644 --- a/docs/test/test-runner-overview.md +++ b/docs/test/test-runner-overview.md @@ -10,7 +10,7 @@ The test runner component offers hardware validation, diagnostics and benchmarki - Reporting test results as Kubernetes events -Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) and AMD GPU Field Health Check (AGFHC) toolkit to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired. +Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) and AMD GPU Field Health Check (AGFHC) toolkit to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired. ```{note} 1. The [public test runner image](https://hub.docker.com/r/rocm/test-runner) only supports executing RVS test. diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 8225fbc44..42f897783 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -18,16 +18,15 @@ To collect logs from the AMD GPU Operator: kubectl logs -n kube-amd-gpu ``` - ## Potential Issues with default ``DeviceConfig`` * Please refer to {ref}`typical-deployment-scenarios` for more information and get corresponding ```helm install``` commands and configs that fits your specific use case. -* If operand pods (e.g. device plugin, metrics exporter) are stuck in ``Init:0/1`` state, it means your GPU worker doesn't have GPU driver loaded or driver was not loaded properly. +* If operand pods (e.g. device plugin, metrics exporter) are stuck in ``Init:0/1`` state, it means your GPU worker doesn't have GPU driver loaded or driver was not loaded properly. + + * If you try to use inbox or pre-installed driver please check the node ``dmesg`` to see why the driver was not loaded properly. - * If you try to use inbox or pre-installed driver please check the node ``dmesg`` to see why the driver was not loaded properly. - - * If you want to deploy out-of-tree driver, we suggest check the [Driver Installation Guide](./drivers/installation) then modify the default ``DeviceConfig`` to ask Operator to install the out-of-tree GPU driver for your worker nodes. + * If you want to deploy out-of-tree driver, we suggest check the [Driver Installation Guide](./drivers/installation) then modify the default ``DeviceConfig`` to ask Operator to install the out-of-tree GPU driver for your worker nodes. ```bash kubectl edit deviceconfigs -n kube-amd-gpu default @@ -37,19 +36,19 @@ kubectl edit deviceconfigs -n kube-amd-gpu default If the AMD GPU driver build fails: -- Check the status of the build pod: +* Check the status of the build pod: ```bash kubectl get pods -n kube-amd-gpu ``` -- View the build pod logs: +* View the build pod logs: ```bash kubectl logs -n kube-amd-gpu ``` -- Check events for more information: +* Check events for more information: ```bash kubectl get events -n kube-amd-gpu @@ -65,8 +64,8 @@ The [techsupport-dump script](https://github.com/ROCm/gpu-operator/blob/main/too Options: -- `-w`: wide option -- `-o yaml/json`: output format (default: json) -- `-k kubeconfig`: path to kubeconfig (default: ~/.kube/config) +* `-w`: wide option +* `-o yaml/json`: output format (default: json) +* `-k kubeconfig`: path to kubeconfig (default: ~/.kube/config) -Please file an issue with collected techsupport bundle on our [GitHub Issues](https://github.com/ROCm/gpu-operator/issues) page \ No newline at end of file +Please file an issue with collected techsupport bundle on our [GitHub Issues](https://github.com/ROCm/gpu-operator/issues) page diff --git a/docs/upgrades/upgrade.md b/docs/upgrades/upgrade.md index 03178ffb1..412362100 100644 --- a/docs/upgrades/upgrade.md +++ b/docs/upgrades/upgrade.md @@ -36,8 +36,8 @@ All pods should be in the `Running` state. Resolve any issues such as restarts o * ```pre-upgrade-check```: The AMD GPU Operator includes a **pre-upgrade** hook that prevents upgrades if any **driver upgrades** are active. This ensures stability by blocking the upgrade when the operator is actively managing driver installations. * ```upgrade-crd```: This hook helps users to patch the new version Custom Resource Definition (CRD) to the helm deployment. Helm by default doesn't support automatic upgrade of CRD so we implemented this hook for auto-upgrade the CRDs. -- **Manual Driver Upgrades in KMM:** Manual driver upgrades initiated by users through KMM are allowed but not recommended during an operator upgrade. -- **Skipping the Hook:** If necessary, you can bypass the pre-upgrade hook (not recommended) by adding ```--no-hooks```, you would have to manually use new version's CRD to upgrade then in cluster. +* **Manual Driver Upgrades in KMM:** Manual driver upgrades initiated by users through KMM are allowed but not recommended during an operator upgrade. +* **Skipping the Hook:** If necessary, you can bypass the pre-upgrade hook (not recommended) by adding ```--no-hooks```, you would have to manually use new version's CRD to upgrade then in cluster. #### Error Scenario @@ -120,9 +120,9 @@ kubectl get deviceconfigs -n kube-amd-gpu -oyaml #### **Notes** -- Avoid upgrading during active driver upgrades initiated by the operator. -- Use `--no-hooks` only if necessary and after assessing the potential impact. -- For additional troubleshooting, check operator logs: +* Avoid upgrading during active driver upgrades initiated by the operator. +* Use `--no-hooks` only if necessary and after assessing the potential impact. +* For additional troubleshooting, check operator logs: ```bash kubectl logs -n kube-amd-gpu amd-gpu-operator-controller-manager-848455579d-p6hlm diff --git a/example/metricsExporter/mtls-rbac-auth/README.md b/example/metricsExporter/mtls-rbac-auth/README.md index 7ee2ac11c..ec0811f4c 100644 --- a/example/metricsExporter/mtls-rbac-auth/README.md +++ b/example/metricsExporter/mtls-rbac-auth/README.md @@ -1,24 +1,31 @@ # Metrics Exporter with Mutual TLS (mTLS) Authentication + This example demonstrates how to securely expose the AMD GPU Metrics Exporter using kube-rbac-proxy with mutual TLS (mTLS) authentication. In this mode, clients must present a valid TLS certificate signed by a trusted CA, and kube-rbac-proxy verifies the certificate and uses the Common Name (CN) for authorization. In the static authorization mode, kube-rbac-proxy authorizes the client using a static config and avoids initiating a SubjectAccessReview to the K8s API server. This example supports: + - Curl-based access from within/outside the cluster - Prometheus scraping using client certificate authentication **Note**: This mode does not use Kubernetes tokens. Even if provided, Bearer tokens are ignored. Authentication and authorization rely entirely on the client certificate's CN and Kubernetes SubjectAccessReview (SAR)/Static Authorization. ## Prerequisites + - AMD GPU Operator is deployed. - Prometheus Operator is deployed in your cluster in monitoring namespace (optional, if testing Prometheus integration). ## 1. Generate TLS Certificates for Server and Client + Create a Certificate Authority (CA), server certificate for kube-rbac-proxy, and a client certificate for Prometheus: + ```bash # Generate CA openssl genrsa -out ca.key 2048 openssl req -x509 -new -nodes -key ca.key -subj "/CN=my-ca" -days 3650 -out ca.crt ``` + Prometheus requires the server and client certificates to include a Subject Alternative Name (SAN) field. We'll first create a san file to define the SAN extension, then generate the certificate. + ```bash # Create a SAN config file cat < san-server.cnf @@ -47,6 +54,7 @@ openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \ ``` Create the client certificate: + ```bash # Create SAN config file cat < san-client.cnf @@ -80,18 +88,21 @@ openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \ ## 2. Create Kubernetes Secrets and ConfigMaps Server TLS Secret (for kube-rbac-proxy): + ```bash kubectl create secret tls server-metrics-tls \ --cert=server.crt --key=server.key -n kube-amd-gpu ``` Client CA ConfigMap (for kube-rbac-proxy to verify incoming client certs): + ```bash kubectl create configmap client-ca \ --from-file=ca.crt=ca.crt -n kube-amd-gpu ``` Client TLS Secret (for Prometheus, in GPU Operator namespace): + ```bash kubectl create secret generic prom-client-cert \ --from-file=client.crt=client.crt \ @@ -99,12 +110,14 @@ kubectl create secret generic prom-client-cert \ ``` Server CA ConfigMap (for Prometheus, in GPU Operator namespace): + ```bash kubectl create configmap prom-server-ca \ --from-file=ca.crt=ca.crt -n kube-amd-gpu ``` ## 3. Create RBAC for the Client Certificate CN + The CN from the client certificate must be authorized to access `/metrics`. We use Kubernetes RBAC to allow this by granting the `GET` permission to the CN extracted from the client cert. **Note**: This section only applies for the rbac authorization using `SubjectAccessReview` requests made to the API Server. Ignore these configs for Static Authorization. @@ -139,10 +152,12 @@ subjects: ## 4. Apply DeviceConfig This config enables: + - kube-rbac-proxy with mTLS - Automatic ServiceMonitor with client certificate authentication kube-rbac-proxy config (part of DeviceConfig): + ```yaml rbacConfig: enable: true @@ -153,6 +168,7 @@ rbacConfig: ``` ServiceMonitor config (part of DeviceConfig): + ```yaml prometheus: serviceMonitor: @@ -180,10 +196,13 @@ prometheus: serverName: my-metrics-service # Must match the CN or SAN in the server certificate insecureSkipVerify: false ``` + ### Using mTLS with Static Authorization + This mode simplifies the authorization flow by bypassing all RBAC lookups to the Kubernetes API server. Instead of checking the client's identity via `SubjectAccessReview`, the kube-rbac-proxy directly compares the Common Name (CN) in the client certificate against a preconfigured value. Mutual TLS (mTLS) is still required, Prometheus must present valid client certificates. kube-rbac-proxy compares the client CommonName (CN) to a configured string (`clientName`) in the `staticAuthorization` section. If it matches, access is allowed. To enable this mode, add the `staticAuthorization` block under `rbacConfig`. Prometheus config remains the same. + ```yaml metricsExporter: rbacConfig: @@ -198,12 +217,15 @@ metricsExporter: ``` Apply the DeviceConfig: + ```bash kubectl apply -f deviceconfig.yaml ``` ## 5. Scraping the Metrics + ### Scraping using `curl` + Get the metrics endpoint IP and port. You can use either: - For ClusterIP service: `kubectl get endpoints -n kube-amd-gpu` to find EndpointIP:ClusterPort @@ -218,7 +240,9 @@ curl --cert ./client.crt --key ./client.key --cacert ./ca.crt -v -s -k -H "Accep You should receive metrics if the client cert is valid and RBAC grants access (or static authorization matches). ### Scraping using Prometheus + To configure Prometheus to scrape the secured endpoint, you need to ensure it discovers the `ServiceMonitor` created by the GPU Operator. Configure the Prometheus spec: + ```yaml # prometheus spec: spec: @@ -230,12 +254,14 @@ spec: matchLabels: example: prom-mtls ``` + Refer to the Prometheus section in the [token-based-auth](../token-based-auth/README.md) example to edit the Prometheus object. Once Prometheus Operator discovers the `ServiceMonitor` and has the required permissions, it will configure the underlying Prometheus instance to scrape the `/metrics` endpoint using the mTLS configuration specified in the `ServiceMonitor`'s `tlsConfig`. You should see the targets being discovered and scraped in the Prometheus UI. ## Summary + This example walks through a secure mTLS configuration where: - kube-rbac-proxy verifies both server and client identities - Prometheus authenticates with a client certificate - RBAC policies grant access based on the certificate CN -- Tokens are not used or required in this setup. If static authentication is enabled, SAR requests to the API server are avoided too. \ No newline at end of file +- Tokens are not used or required in this setup. If static authentication is enabled, SAR requests to the API server are avoided too. diff --git a/example/metricsExporter/token-based-auth/README.md b/example/metricsExporter/token-based-auth/README.md index 0c7331954..f4cb1cc04 100644 --- a/example/metricsExporter/token-based-auth/README.md +++ b/example/metricsExporter/token-based-auth/README.md @@ -1,4 +1,5 @@ # Metrics Exporter with TLS & Token Authentication + This example demonstrates how to securely expose the AMD GPU Metrics Exporter using kube-rbac-proxy with TLS enabled and access control enforced via Kubernetes Bearer Tokens. This setup includes both: - Curl-based scraping (from inside and outside the cluster) @@ -21,6 +22,7 @@ openssl req -x509 -new -nodes -key ca.key -subj "/CN=my-ca" -days 3650 -out ca.c ``` Prometheus requires the certificates to include a Subject Alternative Name (SAN) field. We'll first create a san file to define the SAN extension, then generate the certificate. + ```bash # Create a SAN config file cat < san-server.cnf @@ -58,6 +60,7 @@ rules: ``` Bind the clusterrole to the default ServiceAccount in the metrics-reader namespace: + ```yaml roleRef: apiGroup: rbac.authorization.k8s.io @@ -76,7 +79,9 @@ kubectl create namespace metrics-reader kubectl apply -f clusterrole.yaml kubectl apply -f clusterrolebinding.yaml ``` + ## 3. Generate ServiceAccount token + Get the token for the `default` service account in the metrics-reader namespace: ```bash @@ -86,16 +91,19 @@ TOKEN=$(kubectl create token default -n metrics-reader --duration=24h | tr -d '\ ## 4. Create Kubernetes Secrets and ConfigMaps Server TLS Secret (for kube-rbac-proxy): + ```bash kubectl create secret tls server-metrics-tls --cert=server.crt --key=server.key -n kube-amd-gpu ``` Client Token Secret (for Prometheus, in GPU Operator namespace): + ```bash kubectl create secret generic prom-token --from-literal=token="$TOKEN" -n kube-amd-gpu ``` Client CA Certificate ConfigMap (for Prometheus, in GPU Operator namespace): + ```bash kubectl create configmap prom-server-ca --from-file=ca.crt=ca.crt -n kube-amd-gpu ``` @@ -103,10 +111,12 @@ kubectl create configmap prom-server-ca --from-file=ca.crt=ca.crt -n kube-amd-gp ## 5. Apply DeviceConfig This DeviceConfig enables: + - kube-rbac-proxy serving https over TLS - Automatic ServiceMonitor creation with token-based auth and TLS kube-rbac-proxy config (part of DeviceConfig): + ```yaml rbacConfig: enable: true @@ -115,6 +125,7 @@ rbacConfig: ``` ServiceMonitor Config (part of DeviceConfig): + ```yaml prometheus: serviceMonitor: @@ -140,11 +151,13 @@ prometheus: ``` Apply the `DeviceConfig`: + ```bash kubectl apply -f deviceconfig.yaml ``` ## 6. Scraping the metrics + ### Scraping using `curl` Get the metrics endpoint IP and port. You can use either: @@ -206,6 +219,7 @@ spec: After saving the changes, Prometheus Operator will reconfigure Prometheus. You should see the GPU metrics endpoint appear as a target in the Prometheus UI (`Status` -> `Targets`). ## Summary + This example walks through a secure token-based authentication setup where: - kube-rbac-proxy secures metrics endpoints with TLS @@ -214,4 +228,4 @@ This example walks through a secure token-based authentication setup where: - Both curl requests and Prometheus can securely scrape metrics - A ServiceMonitor with proper TLS and token configuration enables automatic discovery -This provides a robust way to secure metrics endpoints while allowing authorized access to monitoring systems. The [mtls-rbac-auth](../mtls-rbac-auth/README.md) example will build on this and demonstrate how to simplify authentication with mTLS using certificates, removing the need for ServiceAccount tokens and TokenReview API access. The static authorization section will further demonstate how to simplify RBAC, removing the need for SubjectAccessReview API access entirely. \ No newline at end of file +This provides a robust way to secure metrics endpoints while allowing authorized access to monitoring systems. The [mtls-rbac-auth](../mtls-rbac-auth/README.md) example will build on this and demonstrate how to simplify authentication with mTLS using certificates, removing the need for ServiceAccount tokens and TokenReview API access. The static authorization section will further demonstate how to simplify RBAC, removing the need for SubjectAccessReview API access entirely. diff --git a/helm-charts-k8s/README.md b/helm-charts-k8s/README.md index 31355583d..cfe8d1dc2 100644 --- a/helm-charts-k8s/README.md +++ b/helm-charts-k8s/README.md @@ -83,20 +83,19 @@ Installation Options > It is strongly recommended to use AMD-optimized KMM images included in the operator release. This is not required when installing the GPU Operator on Red Hat OpenShift. ### 3. Install Custom Resource -After the installation of AMD GPU Operator: - * By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` - * If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. - * For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource). - * Potential Failures with default `DeviceConfig`: +After the installation of AMD GPU Operator: - a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes. +* By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default` +* If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```. +* For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource). +* Potential Failures with default `DeviceConfig`: + a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes. `kubectl edit deviceconfigs -n kube-amd-gpu default` - - b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node. - * Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"` - * If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"` - * You can always customize the node selector of the `DeviceConfig`. + b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node. + * Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"` + * If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"` + * You can always customize the node selector of the `DeviceConfig`. ### Grafana Dashboards