-
Notifications
You must be signed in to change notification settings - Fork 104
feat: Add status validation to AKSNodeClass to ensure disk encryption set perms are present #1372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| // Cache to avoid redundant DES RBAC validation checks | ||
| // Stores the DES ID and the timestamp when it was last successfully validated | ||
| validatedDES string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably not necessary to further store the ID here since it's available from opts...
| // DESValidationCacheTTL defines how long to cache successful DES RBAC validations | ||
| // After this duration, validation will be re-run to detect permission revocations | ||
| // 1 minute balances performance with timely detection of permission changes | ||
| DESValidationCacheTTL = 1 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TTL should probably be higher since we suspect this RBAC will not change very often if ever.
If it's successful on first try, we maybe don't want to recheck at all or very rarely.
If it's failure, maybe we should check each time (or at least frequently) to see if the role assignments have been added by the user yet
| return &ValidationReconciler{ | ||
| kubeClient: kubeClient, | ||
| diskEncryptionSetsAPI: diskEncryptionSetsAPI, | ||
| options: opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of passing in opts, we just set the DES id as one of the fields of the reconciler
| "Grant the Reader role on the DiskEncryptionSet to the controlling identity. "+ | ||
| "For self-hosted installations, this is the Karpenter workload identity. "+ | ||
| "For managed/AKS-hosted installations, this is the AKS cluster control plane identity. "+ | ||
| "See https://learn.microsoft.com/en-us/azure/aks/use-karpenter for details", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should instead have this link to the DES doc
|
|
||
| BeforeEach(func() { | ||
| ctx = context.Background() | ||
| reconciler = status.NewValidationReconciler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add unit tests for failure vs. success scenarios and how they impact caching here. maybe instead of "success time".
If we decide to do away with the cache for successful validation, then we could instead have a field on the reconciler:
validatedDES bool
| }) | ||
| ctx = options.ToContext(ctx) | ||
| statusController := status.NewController(env.Client, azureEnv.KubernetesVersionProvider, azureEnv.ImageProvider, env.KubernetesInterface, azureEnv.SubnetsAPI) | ||
| statusController := status.NewController(env.Client, azureEnv.KubernetesVersionProvider, azureEnv.ImageProvider, env.KubernetesInterface, azureEnv.SubnetsAPI, azureEnv.DiskEncryptionSetsAPI, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options will be removed if DES id is part of the reconciler constructor
| } | ||
|
|
||
| // HasRole checks if principalID has roleDefinitionID at scope. | ||
| func (r *RBACManager) HasRole(ctx context.Context, scope, principalID, roleDefinitionID string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove ==> no longer used
| } | ||
|
|
||
| // RemoveRole removes all role assignments of principalID at scope. | ||
| func (r *RBACManager) RemoveRole(ctx context.Context, scope, principalID string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove ==> no longer used
|
|
||
| ctx := context.Background() | ||
|
|
||
| // TODO(rbac): Use a shared constant for the RBAC error message here and in the implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take care of this ^^
Description
Adds additional validation so the AKSNodeClass will be marked as
NotReadyif Karpenter doesn't have permission to read disk encryption sets.I've gone with a minimal approach s.t. the controller will attempt to get the DiskEncryptionSet with the supplied ID. If it fails with a 403, the error message makes it clear that a user will need to assign a role to the controlling identity in order to create VMs.
Additionally, during development (and when delegating some tasks to AI), I found that the identity model for self-hosted vs. managed clusters was not clear and it made it more difficult to determine which to grant permissions to. I added a doc (which maybe should also be extended as a
skillfor AI) to help bridge this gap. I also had AI add a doc for different modes of running e2e tests since that, too, wasn't clear.How was this change tested?
Does this change impact docs?
Release Note