fix: redacting user retirement data in lms#105
fix: redacting user retirement data in lms#105ktyagiapphelix2u wants to merge 7 commits intorelease-ulmofrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the user retirement cleanup process to support redaction of PII fields instead of deletion. The changes add configurable redacted values for username, email, and name fields that can be passed through the cleanup workflow.
Changes:
- Modified
bulk_cleanup_retirementsAPI method to accept optional redacted values for username, email, and name fields - Added three new CLI options (
--redacted_username,--redacted_email,--redacted_name) with default value of 'redacted' - Updated tests to verify the new parameters are passed correctly through the function call chain
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/user_retirement/utils/edx_api.py | Updated bulk_cleanup_retirements method signature to accept optional redacted field values and updated docstring to reflect redaction operation |
| scripts/user_retirement/retirement_archive_and_cleanup.py | Added CLI options for redacted field values and updated _cleanup_retirements_or_exit to pass these values to the API |
| scripts/user_retirement/tests/test_retirement_archive_and_cleanup.py | Updated test assertions to include the three new redacted parameters and fixed error message assertion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
[question] Should this be combined in the same PR as https://github.com/openedx/openedx-platform/pull/37886/changes, only against edx/edx-platform if we plan on merging/testing there first?
We can but when we plan to merge, we will cherry pick that and merge both of them together. Sounds good to you ? or if you want we can change those files in this edx/edx-platform and merge both the PR's. |
5de32d5 to
5b3b4ba
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/user_retirement/tests/test_retirement_archive_and_cleanup.py
Outdated
Show resolved
Hide resolved
5b5dfb3 to
f527d86
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
scripts/user_retirement/retirement_archive_and_cleanup.py:291
- Documentation incomplete: The docstring for the
archive_and_cleanupfunction describes the cleanup process as "Deleting them from LMS (by username)" but doesn't mention the new redaction step that occurs before deletion. The docstring should be updated to reflect that PII fields are now redacted with configurable values before the records are deleted.
Consider updating step 3 to: "3- Redacting PII fields with configurable placeholder values, then deleting them from LMS (by username)"
"""
Cleans up UserRetirementStatus rows in LMS by:
1- Getting all rows currently in COMPLETE that were created --cool_off_days ago or more,
unless a specific timeframe is specified
2- Archiving them to S3 in an Athena-queryable format
3- Deleting them from LMS (by username)
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
scripts/user_retirement/retirement_archive_and_cleanup.py:290
- Outdated documentation: The docstring states "3- Deleting them from LMS (by username)" but the implementation has changed to first redact PII fields before deletion. The docstring should be updated to reflect this change, for example: "3- Redacting PII fields and then deleting them from LMS (by username)"
"""
Cleans up UserRetirementStatus rows in LMS by:
1- Getting all rows currently in COMPLETE that were created --cool_off_days ago or more,
unless a specific timeframe is specified
2- Archiving them to S3 in an Athena-queryable format
3- Deleting them from LMS (by username)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Show resolved
Hide resolved
9a8db7c to
8a26e3f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@ktyagiapphelix2u: Please scroll to the top of the Conversation view for this PR to take care of all copilot PR comments as discussed. Ensure that has been completed before requesting a re-review. We can continue the review on this PR. Thank you! |
@robrap Everything is checked out you can look into it. |
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
| # Django optimizes by deleting the records using their IDs rather than re-filtering | ||
| for delete_query in delete_queries: | ||
| sql_lower = delete_query['sql'] | ||
| # Ensure it's a single DELETE statement for the UserRetirementStatus table |
There was a problem hiding this comment.
Doesn't len(delete_queries) == 1 above already ensure if Is a single DELETE query. Here, don't we just want to ensure that the delete query is FROM user_api_userretirementstatus, or something like that?
| sql_lower = delete_query['sql'] | ||
| # Ensure it's a single DELETE statement for the UserRetirementStatus table | ||
| assert 'DELETE FROM' in sql_lower and 'user_api_userretirementstatus' in sql_lower, ( | ||
| f"DELETE query doesn't match expected pattern: {sql_lower}" |
There was a problem hiding this comment.
Nit: Something like this would be more specific, right?
| f"DELETE query doesn't match expected pattern: {sql_lower}" | |
| f"DELETE query against unexpected table: {sql_lower}" |
| ``` | ||
|
|
||
| Deletes a batch of retirement requests by username. | ||
| Redacts a batch of retirement requests by redacting PII fields. |
There was a problem hiding this comment.
This is still being done by username, right? Also, we are still deleting, right? We are just adding redacting before the delete, right?
| Redacts a batch of retirement requests by redacting PII fields. | |
| Redacts and then deletes a batch of retirement requests by username. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Redact PII fields first, then delete. In case an ETL tool is syncing data | ||
| # to a downstream data warehouse, and treats the deletes as soft-deletes, | ||
| # the data will have first been redacted, protecting the sensitive PII. | ||
| retirements.update( | ||
| original_username=redacted_username, | ||
| original_email=redacted_email, | ||
| original_name=redacted_name | ||
| ) | ||
|
|
||
| # Delete using fresh filter by the redacted values to ensure a single bulk DELETE query | ||
| UserRetirementStatus.objects.filter( | ||
| original_username=redacted_username, | ||
| original_email=redacted_email, | ||
| original_name=redacted_name, | ||
| current_state=complete_state | ||
| ).delete() |
There was a problem hiding this comment.
The DELETE step filters only by the redacted field values + COMPLETE state, which can delete unrelated UserRetirementStatus rows that coincidentally already have those same values (e.g., a record with original_* already set to 'redacted'). To avoid over-deletion, delete the same rows you redacted by tracking their IDs (or another stable identifier) before the UPDATE and then deleting by that ID set (ideally in a transaction).
| # Redact PII fields first, then delete. In case an ETL tool is syncing data | |
| # to a downstream data warehouse, and treats the deletes as soft-deletes, | |
| # the data will have first been redacted, protecting the sensitive PII. | |
| retirements.update( | |
| original_username=redacted_username, | |
| original_email=redacted_email, | |
| original_name=redacted_name | |
| ) | |
| # Delete using fresh filter by the redacted values to ensure a single bulk DELETE query | |
| UserRetirementStatus.objects.filter( | |
| original_username=redacted_username, | |
| original_email=redacted_email, | |
| original_name=redacted_name, | |
| current_state=complete_state | |
| ).delete() | |
| # Track the specific rows to be redacted and deleted using their primary keys | |
| retirement_ids = list(retirements.values_list('id', flat=True)) | |
| # Redact PII fields first, then delete. In case an ETL tool is syncing data | |
| # to a downstream data warehouse, and treats the deletes as soft-deletes, | |
| # the data will have first been redacted, protecting the sensitive PII. | |
| with transaction.atomic(): | |
| UserRetirementStatus.objects.filter(id__in=retirement_ids).update( | |
| original_username=redacted_username, | |
| original_email=redacted_email, | |
| original_name=redacted_name, | |
| ) | |
| # Delete exactly the same rows that were redacted, avoiding over-deletion | |
| UserRetirementStatus.objects.filter(id__in=retirement_ids).delete() |
There was a problem hiding this comment.
The current code is probably ok, but this suggestion does seem more trustworthy. Can you just pull out the ids and update by those ids?
There was a problem hiding this comment.
Yeah that makes sense I’ll grab the IDs first and update/delete using those to make it safer
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Show resolved
Hide resolved
robrap
left a comment
There was a problem hiding this comment.
- New copilot comments come after each commit, and there are new ones now. Reminder to handle them appropriately.
- The new Github files view has features to review all Comments (you can filter out resolved comments). This may be helpful to see what has a response. Thanks.
| raise TypeError("Usernames should be an array.") | ||
|
|
||
| complete_state = RetirementState.objects.get(state_name="COMPLETE") | ||
| # Get the retirement records to delete |
There was a problem hiding this comment.
| # Get the retirement records to delete | |
| # Get the retirement records to redact and delete |
| # Redact PII fields first, then delete. In case an ETL tool is syncing data | ||
| # to a downstream data warehouse, and treats the deletes as soft-deletes, | ||
| # the data will have first been redacted, protecting the sensitive PII. | ||
| retirements.update( | ||
| original_username=redacted_username, | ||
| original_email=redacted_email, | ||
| original_name=redacted_name | ||
| ) | ||
|
|
||
| # Delete using fresh filter by the redacted values to ensure a single bulk DELETE query | ||
| UserRetirementStatus.objects.filter( | ||
| original_username=redacted_username, | ||
| original_email=redacted_email, | ||
| original_name=redacted_name, | ||
| current_state=complete_state | ||
| ).delete() |
There was a problem hiding this comment.
The current code is probably ok, but this suggestion does seem more trustworthy. Can you just pull out the ids and update by those ids?
Description
This PR adds a new option to the retirement cleanup script that lets you pass a custom placeholder value to replace user information. It then sends this value through to the LMS system so both systems can work together to redact user data.
Private Link Ticket
https://2u-internal.atlassian.net/browse/BOMS-293