-
Notifications
You must be signed in to change notification settings - Fork 244
Adding support for AMDAMA (Supernova) GPUs. #7697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -980,6 +980,32 @@ ensureGPUDrivers() { | |
| fi | ||
| } | ||
|
|
||
| setupAmdAma() { | ||
| if [ "$(isARM64)" -eq 1 ]; then | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore. |
||
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ? One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard |
||
| # Install driver | ||
| sudo tdnf install -y azurelinux-repos-amd | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we don't need sudo, since the CSE extension is running as root. we recently had an outage cause some scripts were calling SUDO using an expirer root password..
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should use the helper function for any dnf command. which handle retries, etc |
||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we combine into a single |
||
| # Install core package | ||
| sudo tdnf install -y libzip | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. all tdnf calls will not work on VM with restricted egress network. |
||
| sudo tdnf install -y azurelinux-repos-extended | ||
| sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR. |
||
| # Install AKS device plugin | ||
| sudo tdnf install -y amdama-device-plugin.x86_64 | ||
| # Configure huge pages | ||
| sudo sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sudo sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | ||
| if [ $(systemctl is-active kubelet) = "active" ]; then | ||
| sudo systemctl restart kubelet | ||
| fi | ||
| fi | ||
| } | ||
|
|
||
| disableSSH() { | ||
| # On ubuntu, the ssh service is named "ssh.service" | ||
| systemctlDisableAndStop ssh || exit $ERR_DISABLE_SSH | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -395,6 +395,13 @@ function nodePrep { | |
| echo $(date),$(hostname), "End configuring GPU drivers" | ||
| fi | ||
|
|
||
| # Install and configure AMD AMA (Supernova) drivers if this is an AMA node | ||
| if [ "${AMDAMA_NODE}" = "true" ]; then | ||
| logs_to_events "AKS.CSE.setupAmdAma" setupAmdAma | ||
| else | ||
| logs_to_events "AKS.CSE.setupAmdAma" "echo AMD AMA HW not found!" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to log this eventk, knowing it will be the case 99.9% of the time ? |
||
| fi | ||
|
|
||
| export -f enableManagedGPUExperience | ||
| ENABLE_MANAGED_GPU_EXPERIENCE=$(retrycmd_silent 10 1 10 bash -cx enableManagedGPUExperience) | ||
| if [ "$?" -ne 0 ] && [ "${GPU_NODE}" = "true" ] && [ "${skip_nvidia_driver_install}" != "true" ]; then | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1083,6 +1083,9 @@ | |
| "GPUDriverType": func() string { | ||
| return GetGPUDriverType(profile.VMSize) | ||
| }, | ||
| "AmdAmaEnabledSKU": func() bool { | ||
| return IsAmdAmaEnabledSKU(profile.VMSize) | ||
| }, | ||
| "GetHnsRemediatorIntervalInMinutes": func() uint32 { | ||
| // Only need to enable HNSRemediator for Windows 2019 | ||
| if cs.Properties.WindowsProfile != nil && profile.Distro == datamodel.AKSWindows2019Containerd { | ||
|
|
@@ -1269,6 +1272,15 @@ | |
| return datamodel.FabricManagerGPUSizes[strings.ToLower(size)] | ||
| } | ||
|
|
||
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP. |
||
| switch vmSize { | ||
| case "Standard_NM16ads_MA35D": | ||
| return true | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| func areCustomCATrustCertsPopulated(config datamodel.NodeBootstrappingConfiguration) bool { | ||
| return config.CustomCATrustConfig != nil && len(config.CustomCATrustConfig.CustomCATrustCerts) > 0 | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,6 +44,14 @@ | |
| return false | ||
| } | ||
|
|
||
| func IsAmdAmaEnabledSKU(vmSize string) bool { | ||
| switch vmSize { | ||
| case "Standard_NM16ads_MA35D": | ||
| return true | ||
| } | ||
| return false | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method duplicated 3 times. Should be no more than 2. (Ideally 1)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also. would you recommend breaking form the model to make lint happy? |
||
|
|
||
| // GetStorageAccountType returns the support managed disk storage tier for a give VM size. | ||
| func GetStorageAccountType(sizeName string) (string, error) { | ||
| spl := strings.Split(sizeName, "_") | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -132,6 +132,7 @@ func getCSECommandVariables(config *datamodel.NodeBootstrappingConfiguration) pa | |
| "userAssignedIdentityID": config.UserAssignedIdentityClientID, | ||
| "isVHD": isVHD(profile), | ||
| "gpuNode": strconv.FormatBool(config.EnableNvidia), | ||
| "amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks unused? |
||
| "sgxNode": strconv.FormatBool(datamodel.IsSgxEnabledSKU(profile.VMSize)), | ||
| "configGPUDriverIfNeeded": config.ConfigGPUDriverIfNeeded, | ||
| "enableGPUDevicePluginIfNeeded": config.EnableGPUDevicePluginIfNeeded, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AzureLinux is the only OS that support this VMSku correct ?