Skip to content

Comments

WC2-875: Add command to clean-up instance duplicates#2747

Open
bmonjoie wants to merge 4 commits intodevelopfrom
fix/WC2-875_add_command_to_clean_up_duplicates
Open

WC2-875: Add command to clean-up instance duplicates#2747
bmonjoie wants to merge 4 commits intodevelopfrom
fix/WC2-875_add_command_to_clean_up_duplicates

Conversation

@bmonjoie
Copy link
Member

What problem is this PR solving?

Some instances have duplicates instances (Instance with the same UUID). This adds a command and a celery task to fix the issue.

Related JIRA tickets

WC2-875

Changes

The command looks for duplicate. Then, for each one, it tries to merge them the best it can.

How to test

With a dump of a database with duplicates. Either execute the command or launch the celery task.

@bmonjoie bmonjoie requested a review from madewulf February 16, 2026 15:25
@bmonjoie bmonjoie self-assigned this Feb 16, 2026
@bmonjoie bmonjoie added the bug Something isn't working label Feb 16, 2026
self.stdout.write(self.style.ERROR(" - file_name is None, that's odd!"))
continue
# We should use the content of the latest file received.
file_name = Instance.objects.filter(uuid__isnull=True, file_name=duplicate["file_name"]).order_by(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file_name vairable could be renamed file_name_qs to express that it's a queryset

Copy link
Member

@tdethier tdethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test that because I don't have a WFP setup, but it looks good. I think there's just an order_by() missing

Comment on lines 55 to 72
has_content = Instance.objects.filter(
uuid=duplicate["uuid"], file_name=duplicate["file_name"], json__isnull=False
)
if has_content.count() == 1:
self.stdout.write(
f"{index}. Duplicate with uuid '{duplicate['uuid']}' has only one instance with content."
)
self._delete_instances(
uuid=duplicate["uuid"], file_name=duplicate["file_name"], first_id=has_content.first().id
)
single_content += 1
elif has_content.count() > 1:
self.stdout.write(
f"{index}. Duplicate with uuid '{duplicate['uuid']}' has {has_content.count()} instances with content."
)
# Based on the code of `Instance.import_data`, the first one is always the one that has the data.
self._delete_instances(
uuid=duplicate["uuid"], file_name=duplicate["file_name"], first_id=has_content.first().id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do has_content.first() but your has_content queryset doesn't have an .order_by(). I'm afraid you can't know for sure which one you're going to get here

Comment on lines +143 to +149
def _delete_instances(self, uuid: str, file_name: Union[str, None], first_id: str):
to_delete = Instance.objects.filter(uuid=uuid).exclude(id=first_id)
if file_name:
to_delete = to_delete.filter(file_name=file_name)
self.stdout.write(self.style.WARNING(f" - Deleting {to_delete.count()} instances and linked entities"))
Entity.objects.filter(attributes__in=to_delete).delete()
to_delete.delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these instances also have InstanceFile objects that need to be deleted? I'm not sure if that scenario can happen with the issue we've had

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll have InstanceFile as there are no attachment in WFP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good

multiple_content += 1
else:
self.stdout.write(f"{index}. Duplicate with uuid '{duplicate['uuid']}' has no instances with content.")
first = Instance.objects.filter(uuid=duplicate["uuid"], file_name=duplicate["file_name"]).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about .first() and .order_by()

list_filter = ("birth_date", "gender", "account", "guidelines", ProgrammeType)
list_display = ("id", "birth_date", "gender", "account", "guidelines")
actions = [create_uuid_index_action]
actions = [create_uuid_index_action, clean_up_duplicates_action, clean_up_duplicates_action_dry_run]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

@bmonjoie bmonjoie requested a review from tdethier February 19, 2026 15:43
Copy link
Member

@tdethier tdethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants