-
Notifications
You must be signed in to change notification settings - Fork 192
e2e cleanup of orphaned service principals and identities #4557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
shubhadapaithankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mociarain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some thoughts and questions but overall it looks great. This will be a nice change
|
This LGTM. Just wondering if you tested it with the dry-run mode to see if there were any edge cases that appeared? |
|
@mociarain I tested the changes in dryRun mode: https://msazure.visualstudio.com/AzureRedHatOpenShift/_build/results?buildId=149649712&view=logs&j=a4f1910f-c367-5697-edcd-724d81cde23b&t=bc103e47-9b88-5c56-fcad-f3f2f0b2ed91 |
kimorris27
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the pipeline run, and I thought it was interesting how there were exactly 100 mock-msi SPs. I think there are more than 100 and we're only getting 100 because the list response is paginated, and we're not checking the other pages.
Some more info here: https://learn.microsoft.com/en-us/graph/paging?tabs=http
TBH I'm not sure if going through all of the pages is worth the trouble. Since the clean pipeline runs on a schedule, we'll get all of the SPs eventually.
I would be cool merging this without any changes, but I wanted to point this out in case anyone else has other thoughts.
Agreed. Bump it to 48hrs
+1 on this. I think we don't need to change it but can you add a comment to it just outlining that this is the case and we left it this way assuming it will eventually clear them out. |
|
I realized after running the pipeline in dry run mode that using rc.graphClient.Applications().ByApplicationId(objectID).Delete(ctx, nil) would only delete the mock-msi SP and v4-e2e-* SPs, but not Disk Encryption Sets. These need to be deleted using rc.graphClient.ServicePrincipals().ByServicePrincipalId(objectID).Delete(ctx, nil) as no application object exists to delete for DESs. And if we use ServicePrincipals().Delete() for everything, the SPs would get deleted but leave orphaned application registration. |
a4d2cef to
1648f58
Compare
|
I'm still getting an error for deleting disk encryption sets in dry run mode. |
|
For disk encryption sets, Caden suggested deleting the resource groups that contain these DES instead of deleting individual DESs, which makes sense to me. There's already existing logic for cleaning up resourcegroups with |
The existing code does clean up RGs with those prefixes, but it also checks for the I think we'll be fine once your changes merge though, since we won't be checking that tag anymore. |
|
/azp run ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
cadenmarchese
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is functional as-is, but I have a few suggestions for readability.
pkg/util/purge/serviceprincipals.go
Outdated
| return nil | ||
| } | ||
|
|
||
| func (rc *ResourceCleaner) cleanServicePrincipals(ctx context.Context, prefix string, ttl time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider breaking this function up at this point to make it easier on the reader. I see cleanServicePrincipals currently used for the following:
- Get entra applications matching x filter
- Evaluate whether or not they should be deleted
- Actually delete them
In my mind these are 3 separate functions - wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this function has three distinct responsibilities and would be much cleaner if separated. I've divided it into three functions:
- getApplicationsByPrefix - to list applications with prefix
aro-v4-e2eormock-msi- - shouldDelete to check for deletion eligibility
- cleanServicePrincipals - to actually perform the deletion
pkg/util/purge/serviceprincipals.go
Outdated
| if app.GetDisplayName() != nil { | ||
| displayName = *app.GetDisplayName() | ||
| } | ||
|
|
||
| if app.GetAppId() != nil { | ||
| appID = *app.GetAppId() | ||
| } | ||
|
|
||
| if app.GetId() != nil { | ||
| objectID = *app.GetId() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we need displayName, appID, and objectID to execute the remaining code. Should we simplify these three if to check all 3 at the same time and continue if any of them come back missing?
| if app.GetDisplayName() != nil { | |
| displayName = *app.GetDisplayName() | |
| } | |
| if app.GetAppId() != nil { | |
| appID = *app.GetAppId() | |
| } | |
| if app.GetId() != nil { | |
| objectID = *app.GetId() | |
| } | |
| if app.GetDisplayName() == nil || app.GetAppId() == nil || app.GetId() == nil { | |
| rc.log.Warnf("SKIP: Application missing required fields") | |
| continue | |
| } | |
| displayName := *app.GetDisplayName() | |
| appID := *app.GetAppId() | |
| objectID := *app.GetId() |
Then, we can get rid of line 80.
|
|
||
| if age < ttl { | ||
| rc.log.Debugf("SKIP '%s': Age %v < TTL %v", displayName, age.Round(time.Hour), ttl) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should execute this age check first in the for loop (unless there's something specific about the isMockMSI that changes that behavior that I'm missing).
Related to that - I see we are checking age in another spot, on line 102-104:
} else if createdDateTime == nil {
rc.log.Warnf("SKIP '%s' (objectID: %s): No createdDateTime", displayName, objectID)
continue
}
Can we merge this behavior into a single, top-level age check that does the continue if the SP is not yet eligible to be deleted? It seems that it doesn't matter if it's a mock MSI, or a CSP - if it's not old enough, we'll leave it alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we merge this behavior with something like:
if createdDateTime == nil || time.Since(*createdDateTime) < ttl {
rc.log.Infof(
"SKIP '%s' (objectID: %s): createdDateTime missing or age < TTL",
displayName, objectID,
)
continue
}
if createdDateTime is nil, the time.Since(*createdDateTime) will panic when it tries to dereference the nil pointer.
pkg/util/purge/serviceprincipals.go
Outdated
|
|
||
| createdDateTime = app.GetCreatedDateTime() | ||
|
|
||
| isMockMSI := strings.HasPrefix(displayName, "mock-msi-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this var twice, but always as !isMockMSI, which makes things harder to read. Should we replace this var with an isE2eClusterServicePrincipal instead so we can wrap our CSP-specific logic?
For example:
if isE2eClusterServicePrincipal {
resourceGroupName := determineResourceGroupName(displayName)
shouldKeep, reason := rc.checkSPNeededBasedOnRGStatus(ctx, resourceGroupName, ttl)
if shouldKeep {
rc.log.Infof("SKIP '%s': %s", displayName, reason)
continue
}
It seems that we don't have any special handling for mock MSIs - only CSPs, so we can have a dedicated if for CSPs. And then we can avoid the else if stuff that's tougher to read.
pkg/util/purge/serviceprincipals.go
Outdated
| return nil | ||
| } | ||
|
|
||
| func determineResourceGroupName(displayName string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, we probably don't need this function. We can just use strings.TrimPrefix(displayName, "aro-") inline.
Which issue this PR addresses:
Fixes ARO-22824
What this PR does / why we need it:
Managed identities, cluster service principals, and role assignments are cleaned up as part of e2e cluster deletion, but this doesn't occur if the e2e job times out and doesn't make it to the cleanup step.
While running miwi E2E tests in the E2E Build and Release pipeline, we observed that managed identities and cluster service principals are not being cleaned up when the E2E job times out. This PR adds cleanup logic specifically for orphaned resource groups from prod e2e clusters and service principals, by creating a dedicated cleanup job in pkg/util/purge.
Test plan for issue:
Is there any documentation that needs to be updated for this PR?
no