Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,11 @@ name: Linting

on:
push:
branches:
- develop
branches:
- main
- staging
pull_request:
branches:
- develop
branches:
- main
- staging

jobs:
call-workflow-passing-data:
Expand Down
7 changes: 7 additions & 0 deletions .markdownlint-cli2.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"globs": ["**/*.md"],
"ignores": [
"**/vendor/**",
"**/.git/**"
]
}
192 changes: 167 additions & 25 deletions .wordlist.txt
Original file line number Diff line number Diff line change
@@ -1,61 +1,203 @@
amd
AFID
Affectioned
AGFHC
Allocatable
ACS
AKS
ARI
Autobuild
bb
burnin
CheckUnitStatus
CleanupPreState
CLI
CN
CNI
computePartition
ConfigMap
ConfigMaps
ConditionalWorkflows
CoreOS
CPX
CrashLoopBackOff
CRD
DKMS
DNS
DockerHub
GPUs
HTTPS
KMM
MOK
NFD
OLM
PCI
RBAC
ROCm
TLS
YAML
allocatable
bool
calico
clusterIP
config
configmap
cryptographic
CRDs
CRs
CronJob
Customizable
daemonset
daemonsets
DaemonSet
Daemonsets
DCM
dcm
Depricated
deivce
DeviceConfig
DeviceIDs
DevicePlugin
DevicePluginArguments
DevicePluginImage
DevicePluginImagePullPolicy
DevicePluginSpec
DKMS
dma
DMC
DNS
Dockerfile
DockerHub
DPX
DriverToolkit
ECC
EnableNodeLabeller
ErrImagePull
flannel
GPUs
gpup
Grafana
GracePeriodSeconds
gst
gpuagent
gpuClientSystemdServices
GKE
hbm
HealthThresholds
Helmify
hostname
hostnames
HSIO
HTTPS
iet
IfNotPresent
IgnoreDaemonSets
IgnoreNamespaces
ImageStream
jq
json
kaniko
KMM
kmod
kubectl
Kubelet
KubeVirt
Kuberntes
Kubernetes
kubeconfig
labeller
Labeler
lifecycle
lvl
MachineConfig
modprobe
MachineConfigOperator
MCO
Mericsclient
MaxParallelWorkflows
MaxUnavailable
MCO
memoryPartition
MetricsExporter
MetricsExporterSpec
MinIO
Minio
MOK
MTLS
namespace
NFD
NMC
NodeCondition
NodeDrainPolicy
NodeIP
NodeLabeller
NodeLabellerArguments
NodeLabellerImage
NodeLabellerImagePullPolicy
Nodelabeller
nodename
Nodeport
NodePort
NodeRemediationLabels
NodeRemediationTaints
NoExecute
NPD
NotReady
OperatorHub
numGPUsAssigned
Observability
oc
OLM
OOM
OpenShift
OperatorHub
Openshift
parition
paritioning
pbqt
pebb
PCI
pcie
perf
PFs
Perses
plugin
PodIP
PreFlight
PreStateDB
prometheus
Promethues
quay
rocminfo
QPX
RAS
RBAC
Redhat
RedHat
RHCOS
RMA
rocminfo
rochpl
ROCm
runtime
SAR
schedulable
SDK
selfcheck
ServiceAccounts
ServiceMonitor
ServiecMonitor
skippedGPUs
Slinkproject
SlinkProject
Slrum
Slurm
SPX
StopOnFailure
SubjectAccessReview
systemd
TestCategory
TesterImage
TimeoutSeconds
TokenReview
Tolerations
TODO
TLS
tolerations
tst
TtlForFailedWorkflows
ubuntu
UI
UID
UNCORRECT
Uncordoning
uninstallation
unschedulable
Upgrademgr
UpgradePolicy
validation
verison
VC
VCN
VFIO
VFs
VMs
webhook
uninstallation
xgmi
YAML
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,20 +83,19 @@ Installation Options
> It is strongly recommended to use AMD-optimized KMM images included in the operator release. This is not required when installing the GPU Operator on Red Hat OpenShift.

### 3. Install Custom Resource
After the installation of AMD GPU Operator:
* By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default`
* If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```.
* For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource).

* Potential Failures with default `DeviceConfig`:
After the installation of AMD GPU Operator:

a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes.
* By default there will be a default `DeviceConfig` installed. If you are using default `DeviceConfig`, you can modify the default `DeviceConfig` to adjust the config for your own use case. `kubectl edit deviceconfigs -n kube-amd-gpu default`
* If you installed without default `DeviceConfig` (either by using `--set crds.defaultCR.install=false` or installing a chart prior to v1.3.0), you need to create the `DeviceConfig` custom resource in order to trigger the operator start to work. By preparing the `DeviceConfig` in the YAML file, you can create the resouce by running ```kubectl apply -f deviceconfigs.yaml```.
* For custom resource definition and more detailed information, please refer to [Custom Resource Installation Guide](https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/installation/kubernetes-helm.html#install-custom-resource).
* Potential Failures with default `DeviceConfig`:
a. Operand pods are stuck in ```Init:0/1``` state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the [Driver Installation Guide]([./drivers/installation.md](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/drivers/installation.html#driver-installation-guide)) then modify the default `DeviceConfig` to ask Operator to install the out-of-tree GPU driver for your worker nodes.
`kubectl edit deviceconfigs -n kube-amd-gpu default`

b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node.
* Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"`
* If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"`
* You can always customize the node selector of the `DeviceConfig`.
b. No operand pods showed up: It is possible that default `DeviceConfig` selector `feature.node.kubernetes.io/amd-gpu: "true"` cannot find any matched node.
* Check node label `kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:"`
* If you are using GPU in the VM, you may need to change the default `DeviceConfig` selector to `feature.node.kubernetes.io/amd-vgpu: "true"`
* You can always customize the node selector of the `DeviceConfig`.

### Grafana Dashboards

Expand Down
Loading