Skip to content

fix: redacting user retirement data in lms#105

Open
ktyagiapphelix2u wants to merge 7 commits intorelease-ulmofrom
ktyagi/redaction
Open

fix: redacting user retirement data in lms#105
ktyagiapphelix2u wants to merge 7 commits intorelease-ulmofrom
ktyagi/redaction

Conversation

@ktyagiapphelix2u
Copy link

@ktyagiapphelix2u ktyagiapphelix2u commented Jan 30, 2026

Description

This PR adds a new option to the retirement cleanup script that lets you pass a custom placeholder value to replace user information. It then sends this value through to the LMS system so both systems can work together to redact user data.

Private Link Ticket

https://2u-internal.atlassian.net/browse/BOMS-293

Copilot AI review requested due to automatic review settings January 30, 2026 06:51
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the user retirement cleanup process to support redaction of PII fields instead of deletion. The changes add configurable redacted values for username, email, and name fields that can be passed through the cleanup workflow.

Changes:

  • Modified bulk_cleanup_retirements API method to accept optional redacted values for username, email, and name fields
  • Added three new CLI options (--redacted_username, --redacted_email, --redacted_name) with default value of 'redacted'
  • Updated tests to verify the new parameters are passed correctly through the function call chain

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
scripts/user_retirement/utils/edx_api.py Updated bulk_cleanup_retirements method signature to accept optional redacted field values and updated docstring to reflect redaction operation
scripts/user_retirement/retirement_archive_and_cleanup.py Added CLI options for redacted field values and updated _cleanup_retirements_or_exit to pass these values to the API
scripts/user_retirement/tests/test_retirement_archive_and_cleanup.py Updated test assertions to include the three new redacted parameters and fixed error message assertion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Should this be combined in the same PR as https://github.com/openedx/openedx-platform/pull/37886/changes, only against edx/edx-platform if we plan on merging/testing there first?

@ktyagiapphelix2u
Copy link
Author

[question] Should this be combined in the same PR as https://github.com/openedx/openedx-platform/pull/37886/changes, only against edx/edx-platform if we plan on merging/testing there first?

We can but when we plan to merge, we will cherry pick that and merge both of them together. Sounds good to you ? or if you want we can change those files in this edx/edx-platform and merge both the PR's.

Copilot AI review requested due to automatic review settings February 2, 2026 06:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

scripts/user_retirement/retirement_archive_and_cleanup.py:291

  • Documentation incomplete: The docstring for the archive_and_cleanup function describes the cleanup process as "Deleting them from LMS (by username)" but doesn't mention the new redaction step that occurs before deletion. The docstring should be updated to reflect that PII fields are now redacted with configurable values before the records are deleted.

Consider updating step 3 to: "3- Redacting PII fields with configurable placeholder values, then deleting them from LMS (by username)"

    """
    Cleans up UserRetirementStatus rows in LMS by:
    1- Getting all rows currently in COMPLETE that were created --cool_off_days ago or more,
        unless a specific timeframe is specified
    2- Archiving them to S3 in an Athena-queryable format
    3- Deleting them from LMS (by username)
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 4, 2026 06:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

scripts/user_retirement/retirement_archive_and_cleanup.py:290

  • Outdated documentation: The docstring states "3- Deleting them from LMS (by username)" but the implementation has changed to first redact PII fields before deletion. The docstring should be updated to reflect this change, for example: "3- Redacting PII fields and then deleting them from LMS (by username)"
    """
    Cleans up UserRetirementStatus rows in LMS by:
    1- Getting all rows currently in COMPLETE that were created --cool_off_days ago or more,
        unless a specific timeframe is specified
    2- Archiving them to S3 in an Athena-queryable format
    3- Deleting them from LMS (by username)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 4, 2026 06:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@robrap
Copy link

robrap commented Feb 6, 2026

@ktyagiapphelix2u: Please scroll to the top of the Conversation view for this PR to take care of all copilot PR comments as discussed. Ensure that has been completed before requesting a re-review. We can continue the review on this PR. Thank you!

@ktyagiapphelix2u
Copy link
Author

@ktyagiapphelix2u: Please scroll to the top of the Conversation view for this PR to take care of all copilot PR comments as discussed. Ensure that has been completed before requesting a re-review. We can continue the review on this PR. Thank you!

@robrap Everything is checked out you can look into it.

# Django optimizes by deleting the records using their IDs rather than re-filtering
for delete_query in delete_queries:
sql_lower = delete_query['sql']
# Ensure it's a single DELETE statement for the UserRetirementStatus table
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't len(delete_queries) == 1 above already ensure if Is a single DELETE query. Here, don't we just want to ensure that the delete query is FROM user_api_userretirementstatus, or something like that?

sql_lower = delete_query['sql']
# Ensure it's a single DELETE statement for the UserRetirementStatus table
assert 'DELETE FROM' in sql_lower and 'user_api_userretirementstatus' in sql_lower, (
f"DELETE query doesn't match expected pattern: {sql_lower}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Something like this would be more specific, right?

Suggested change
f"DELETE query doesn't match expected pattern: {sql_lower}"
f"DELETE query against unexpected table: {sql_lower}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

```

Deletes a batch of retirement requests by username.
Redacts a batch of retirement requests by redacting PII fields.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still being done by username, right? Also, we are still deleting, right? We are just adding redacting before the delete, right?

Suggested change
Redacts a batch of retirement requests by redacting PII fields.
Redacts and then deletes a batch of retirement requests by username.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemeted

Copilot AI review requested due to automatic review settings February 10, 2026 06:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1060 to 1075
# Redact PII fields first, then delete. In case an ETL tool is syncing data
# to a downstream data warehouse, and treats the deletes as soft-deletes,
# the data will have first been redacted, protecting the sensitive PII.
retirements.update(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name
)

# Delete using fresh filter by the redacted values to ensure a single bulk DELETE query
UserRetirementStatus.objects.filter(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name,
current_state=complete_state
).delete()
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DELETE step filters only by the redacted field values + COMPLETE state, which can delete unrelated UserRetirementStatus rows that coincidentally already have those same values (e.g., a record with original_* already set to 'redacted'). To avoid over-deletion, delete the same rows you redacted by tracking their IDs (or another stable identifier) before the UPDATE and then deleting by that ID set (ideally in a transaction).

Suggested change
# Redact PII fields first, then delete. In case an ETL tool is syncing data
# to a downstream data warehouse, and treats the deletes as soft-deletes,
# the data will have first been redacted, protecting the sensitive PII.
retirements.update(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name
)
# Delete using fresh filter by the redacted values to ensure a single bulk DELETE query
UserRetirementStatus.objects.filter(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name,
current_state=complete_state
).delete()
# Track the specific rows to be redacted and deleted using their primary keys
retirement_ids = list(retirements.values_list('id', flat=True))
# Redact PII fields first, then delete. In case an ETL tool is syncing data
# to a downstream data warehouse, and treats the deletes as soft-deletes,
# the data will have first been redacted, protecting the sensitive PII.
with transaction.atomic():
UserRetirementStatus.objects.filter(id__in=retirement_ids).update(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name,
)
# Delete exactly the same rows that were redacted, avoiding over-deletion
UserRetirementStatus.objects.filter(id__in=retirement_ids).delete()

Copilot uses AI. Check for mistakes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is probably ok, but this suggestion does seem more trustworthy. Can you just pull out the ids and update by those ids?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense I’ll grab the IDs first and update/delete using those to make it safer

Copy link

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. New copilot comments come after each commit, and there are new ones now. Reminder to handle them appropriately.
  2. The new Github files view has features to review all Comments (you can filter out resolved comments). This may be helpful to see what has a response. Thanks.

raise TypeError("Usernames should be an array.")

complete_state = RetirementState.objects.get(state_name="COMPLETE")
# Get the retirement records to delete
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Get the retirement records to delete
# Get the retirement records to redact and delete

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 1060 to 1075
# Redact PII fields first, then delete. In case an ETL tool is syncing data
# to a downstream data warehouse, and treats the deletes as soft-deletes,
# the data will have first been redacted, protecting the sensitive PII.
retirements.update(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name
)

# Delete using fresh filter by the redacted values to ensure a single bulk DELETE query
UserRetirementStatus.objects.filter(
original_username=redacted_username,
original_email=redacted_email,
original_name=redacted_name,
current_state=complete_state
).delete()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is probably ok, but this suggestion does seem more trustworthy. Can you just pull out the ids and update by those ids?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants