-
Notifications
You must be signed in to change notification settings - Fork 40
Fix Item Duplicate Detection Across Datetime Indexes #575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of things
stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py
Outdated
Show resolved
Hide resolved
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bountx Thanks looks great
stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py
Outdated
Show resolved
Hide resolved
stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py
Outdated
Show resolved
Hide resolved
| alias_exists = await self.client.indices.exists_alias(name=alias) | ||
|
|
||
| if alias_exists: | ||
| return await check_item_exists_in_alias(self.client, alias, doc_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but would be great. I can see that its raising ConflictError, please consider adding a custom error that would have a relevant message error that this item already exists to improve the error quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| raise ConflictError( | ||
| f"Item {item['id']} in collection {item['collection']} already exists" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider improving the error to return not only message, but more information such as which item ids are duplicated and in which conllection. Then you can assert easier the error in tests to check if the conflicted item id == the expected item id that is a duplicate:
raise ConflictError(
f"Item {item['id']} in collection {item['collection']} already exists",
item_id=item["id"],
collection=item["collection"],
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…tence and duplicates
stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py
Outdated
Show resolved
Hide resolved
stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py
Outdated
Show resolved
Hide resolved
| raise ConflictError( | ||
| f"Item {item['id']} in collection {item['collection']} already exists" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the custom error could be applied here as well as in other ConflictError cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created extension of ConflicError class - ItemAlreadyExistsError
stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/utils.py
Outdated
Show resolved
Hide resolved
stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/utils.py
Outdated
Show resolved
Hide resolved
stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py
Outdated
Show resolved
Hide resolved
| super().__init__(message) | ||
|
|
||
|
|
||
| async def check_item_exists_in_alias(client: Any, alias: str, doc_id: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, please, consider using stricter type hints:
client: Union[AsyncElasticsearch, AsyncOpenSearch]
| return bool(resp["hits"]["total"]["value"]) | ||
|
|
||
|
|
||
| def check_item_exists_in_alias_sync(client: Any, alias: str, doc_id: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
|
@jonhealy1 I added small fixes that popped up during dev tests and they pass now |
jonhealy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good nice work
|
@jonhealy1 ready to be merged! |
Description:
When ENABLE_DATETIME_INDEX_FILTERING=true, items with different
datetime values are stored in different datetime-partitioned indexes. The bulk insert methods (
bulk_async_prep_create_item,bulk_sync_prep_create_item) were usingclient.exists(index=alias)directly, which could fail to detect existing items stored in other datetime indexes.Scenario that caused duplicates:
Solution
Extracted the duplicate check logic from
async_prep_create_iteminto reusable helper methods that properly iterate through all indexes in the collection alias:_check_item_exists_in_collection(async)_check_item_exists_in_collection_sync(sync)Updated all three prep methods to use these helpers:
async_prep_create_itembulk_async_prep_create_itembulk_sync_prep_create_itemPR Checklist:
pre-commit run --all-files)make test)