Skip to content

Conversation

@rakechill
Copy link
Contributor

Description
Adds additional validation so the AKSNodeClass will be marked as NotReady if Karpenter doesn't have permission to read disk encryption sets.

I've gone with a minimal approach s.t. the controller will attempt to get the DiskEncryptionSet with the supplied ID. If it fails with a 403, the error message makes it clear that a user will need to assign a role to the controlling identity in order to create VMs.

Additionally, during development (and when delegating some tasks to AI), I found that the identity model for self-hosted vs. managed clusters was not clear and it made it more difficult to determine which to grant permissions to. I added a doc (which maybe should also be extended as a skill for AI) to help bridge this gap. I also had AI add a doc for different modes of running e2e tests since that, too, wasn't clear.

How was this change tested?

  • Updated existing byok e2e test cases to ensure AKSNodeClass is Ready and added an additional OSS-only test to simulate a scenario where the controlling identity (Karpenter workload identity in the case of OSS) tries to create VMs w/o having read access to the disk encryption set.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Add AKSNodeClass status validation for BYOK RBAC 


// Cache to avoid redundant DES RBAC validation checks
// Stores the DES ID and the timestamp when it was last successfully validated
validatedDES string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably not necessary to further store the ID here since it's available from opts...

// DESValidationCacheTTL defines how long to cache successful DES RBAC validations
// After this duration, validation will be re-run to detect permission revocations
// 1 minute balances performance with timely detection of permission changes
DESValidationCacheTTL = 1 * time.Minute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTL should probably be higher since we suspect this RBAC will not change very often if ever.

If it's successful on first try, we maybe don't want to recheck at all or very rarely.

If it's failure, maybe we should check each time (or at least frequently) to see if the role assignments have been added by the user yet

return &ValidationReconciler{
kubeClient: kubeClient,
diskEncryptionSetsAPI: diskEncryptionSetsAPI,
options: opts,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of passing in opts, we just set the DES id as one of the fields of the reconciler

"Grant the Reader role on the DiskEncryptionSet to the controlling identity. "+
"For self-hosted installations, this is the Karpenter workload identity. "+
"For managed/AKS-hosted installations, this is the AKS cluster control plane identity. "+
"See https://learn.microsoft.com/en-us/azure/aks/use-karpenter for details",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should instead have this link to the DES doc


BeforeEach(func() {
ctx = context.Background()
reconciler = status.NewValidationReconciler()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add unit tests for failure vs. success scenarios and how they impact caching here. maybe instead of "success time".

If we decide to do away with the cache for successful validation, then we could instead have a field on the reconciler:
validatedDES bool

})
ctx = options.ToContext(ctx)
statusController := status.NewController(env.Client, azureEnv.KubernetesVersionProvider, azureEnv.ImageProvider, env.KubernetesInterface, azureEnv.SubnetsAPI)
statusController := status.NewController(env.Client, azureEnv.KubernetesVersionProvider, azureEnv.ImageProvider, env.KubernetesInterface, azureEnv.SubnetsAPI, azureEnv.DiskEncryptionSetsAPI, options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options will be removed if DES id is part of the reconciler constructor

}

// HasRole checks if principalID has roleDefinitionID at scope.
func (r *RBACManager) HasRole(ctx context.Context, scope, principalID, roleDefinitionID string) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ==> no longer used

}

// RemoveRole removes all role assignments of principalID at scope.
func (r *RBACManager) RemoveRole(ctx context.Context, scope, principalID string) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ==> no longer used


ctx := context.Background()

// TODO(rbac): Use a shared constant for the RBAC error message here and in the implementation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take care of this ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants