-
-
Notifications
You must be signed in to change notification settings - Fork 230
Case API v.2: Upsert on external_id
#37118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
corehq/apps/hqcase/api/updates.py
Outdated
| domain, | ||
| data['external_id'], | ||
| raise_multiple=True, | ||
| ) # Raises CommCareCase.MultipleObjectsReturned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a 4xx error, as it's a client issue, not server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is that handled somewhere higher up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is handled higher up, as you've demonstrated in tests. Carry on!
| If a case update includes a value for "external_id" and does not include | ||
| a value for "case_id", then the "create" field may be omitted, and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through a range of reactions in response to this PR - first was disapproval ("it removes useful protection against creating duplicates!"), then I thought about the external_id constraint and about where this work was coming from and I realized it does make a lot of sense - maybe this integration doesn't have perfect knowledge of the state of our system.
Then I read your caveats in this doc and see that you too have some concerns. An alternative idea would be to require an explicit opt-in to this behavior. The check described here isn't super intuitive. Ideally we'd make create a ternary operator, but that's hard in JSON - know if we can verify that it's present and set to null? Alternatively, it could be a second flag - create_or_update: true (or upsert: true, whichever term is more broadly understood). Then the check could still require that the client specify one of the two flags. That'd let us be sure of the behavior the client expects, without impacting existing integrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative idea would be to require an explicit opt-in to this behavior. ... Ideally we'd make create a ternary operator, but that's hard in JSON.
"create": null"create_or_update": true"upsert": true
I like the idea of making the behavior opt-in.
Adding another flag feels clumsy, and using "create" as a ternary value feels unintuitive. That's why I settled on neither. But you make a very good point, that we need some way to ensure that this behavior is what the client wants. If I had to choose, I would go with "create": null. It avoids having to draw up logic tables like
create |
create_or_update |
Action |
|---|---|---|
| True | (absent) | Create |
| True | True | Upsert |
| True | False | Create |
| False | (absent) | Update |
| False | True | Update? |
| False | False | Update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I was imagining that we'd require you to provide either create or create_or_update, but disallow both. I agree that it's clumsy - could use an enumerable but it's too late since this is already in-use - that'd have to be a v3 change. I also like "create": null - from a quick look at the code it doesn't seem too hard to implement either, basically just change
is_creation = data.pop('create', None)to
is_creation = data.pop('create', ...)and weave that in to what you've already done here.
Other thought - this applies only to bulk updates, right? Does the same issue also apply to individual updates? Think we need to carve out a similar exemption there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing. I settled on implementing this for bulk updates only, because YAGNI. But moving this change into _get_individual_update() might offer more consistent functionality.
What do you think of 77c8fec ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding my 2 cents, my vote is for "create": null . It seems like the cleanest and most robust design. It's explicit, which is better than the original "omit the flag" approach. But more importantly, it avoids the state-management complexity of adding a new "upsert": true flag . The API shouldn't require people to cross-validate two separate flags. Centralizing all three write-modes (true, false, null) onto the single "create" key is much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I settled on implementing this for bulk updates only, because YAGNI
That is compelling, but I don't think I have strong opinions either way here (yet!)
corehq/apps/hqcase/api/updates.py
Outdated
| if 'create' not in data: | ||
| raise UserError("A 'create' flag is required for each update.") | ||
| updates.append(_get_individual_update(domain, data, user, is_creation)) | ||
| updates.append(_get_individual_update(domain, data, user, is_creation=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer either
| updates.append(_get_individual_update(domain, data, user, is_creation=True)) | |
| updates.append(_get_individual_update(domain, data, user, is_creation=data.pop('create'))) |
or making the other caller of _get_individual_update insert create into data, so there's only one way to pass it in.
corehq/apps/hqcase/api/updates.py
Outdated
| # in bulk and individual updates to override `is_creation` for | ||
| # upsert on external_id when `data['create']` is `None` and | ||
| # `data["external_id"]` is set and data["case_id"] is not set. | ||
| if data['create'] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible way of doing this is making a JsonCaseUpsert (a sibling of JsonCaseCreation and JsonCaseUpdate). The new logic here is working at a different abstraction level than the other parts of the function.
The main thing that gives me pause is that this adds logic to look up cases by external ID, but the JsonCase* classes already support that, and they do the lookup in bulk. See test_update_by_external_id for reference. The main distinction between the two classes is how they handle case IDs and external IDs (and which fields are required).
I'm trying to think of a really clean way to do this and am coming up short - the set of required fields for an upsert varies depending on whether it's an update or an insert, so maybe dispatching to one of the two is the best course of action. Or maybe JsonCaseUpsert.get_caseblock does something like:
def get_caseblock(self, case_db):
case_id = case_db.get_by_external_id(self.external_id)
update_class = JsonCaseCreation if not case_id else JsonCaseUpdate
update = update_class.wrap(self)
return update.get_caseblock(self, case_db)Again, I'm not certain that would come out ideal, but I'll pass it back to you to get your take.
|
When you have a chance @esoergel could you check that the last commits, specifically --
-- implement the changes we discussed. (I'm also happy to squash this branch into a handful of commits and open a new PR if you think that will be easier to review.) |
| def test_get_bulk_duplicate_external_id(self): | ||
| result = get_bulk(self.domain, self.web_user, external_ids=['test2']) | ||
|
|
||
| # When multiple cases have the same external_id, only one is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I guess if you try to get the same case twice, once by case ID and once by external ID, that'd probably show up twice, yeah?
| messaging_events, name="api_messaging_event_detail"), | ||
| url(r'v0\.6/case/bulk-fetch/$', case_api_bulk_fetch), | ||
| url(r'case/v2/bulk-fetch/$', case_api_bulk_fetch, name='case_api_bulk_fetch'), | ||
| # match v0.6/case/ AND v0.6/case/e0ad6c2e-514c-4c2b-85a7-da35bbeb1ff1/ trailing slash optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect, I wonder if I could've just added an additional URL pattern instead of a complicated regex
| url(r'v0\.6/case/ext/(?P<external_id>[\w\-]+)/$', case_api), | ||
| url(r'case/v2/ext/(?P<external_id>[\w\-]+)/$', case_api), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to go above the other two lines? Guessing not, because it has an extra slash? Does the [\w\-]+ character set match what we allow for external ID elsewhere? Do we allow ., for instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 We could extend the pattern and then URL-decode the value to allow any Unicode characters in an external ID. (Edit: Isn't it already URL decoded?)
| case_block = CaseBlock( | ||
| case_id=case_id, | ||
| case_type='player', | ||
| case_name='Irina Levitina', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
♟️
| assert res.status_code == 200 | ||
| assert res.json()['case_id'] == self.case_ids[0] | ||
|
|
||
| def test_get_single_case_by_external_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this file is called test_case_api_bulk_get, but this functionality is not part of that
| elif data['case_id'] != case['case_id']: | ||
| return JsonResponse( | ||
| {'error': 'The given value of "case_id" does not match the ' | ||
| 'existing value for the case identified by ' | ||
| f'external_id = "{external_id}". "case_id" is ' | ||
| 'read-only and cannot be modified.'}, | ||
| status=400, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to call this out specifically
|
|
||
|
|
||
| @sharded | ||
| class TestIndividualUpdate(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what this test covers, I don't think it needs to be a DB test - it doesn't actually submit any forms or create cases, so the tearDown is superfluous, and it looks like _get_individual_update doesn't actually even use the domain arg (must be leftover from some old code shuffling). All it needs user for is for the user ID, so that could easily be mocked, or switch the arg to user_id
| FormProcessorTestUtils.delete_all_cases(self.domain) | ||
| FormProcessorTestUtils.delete_all_xforms(self.domain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likewise doesn't actually submit any forms
| @privilege_enabled(privileges.API_ACCESS) | ||
| @flag_enabled("API_THROTTLE_WHITELIST") | ||
| @patch("corehq.apps.hqcase.api.updates.validate_update_permission", MagicMock()) | ||
| class TestCaseUpdateView(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: test_case_update_api.py does view level testing with the django test client (against the case update API) - might be able to share setup by moving some of this there.
corehq/apps/hqcase/api/updates.py
Outdated
| if 'create' not in data: | ||
| raise UserError("A 'create' flag is required for each update.") | ||
| is_creation = data.pop('create') | ||
| if is_creation is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel like this earlier comment applies:
#37118 (comment)
This does a DB hit in a loop, and this function otherwise doesn't hit the DB. If you look at the current implementation of handle_case_update you can see the data pipeline I'd imagined for updates:
- Wrap the raw (mostly) JSON provided by the user, which does some basic validation
- Initialize a state object
case_dbthat will perform all necessary lookups as optimally as possible (interacting with the wrapped updates to know what to fetch) - Perform a location validation step that requires this state
- Get the caseblocks from the updates, using the
case_dbto do whatever ID lookups are necessary. This may also error if a case is missing.
The location validation step was a later addition - if I were to redo it, the DB lookups would happen on the stateful DB class, and the actual checks would happen on the update class, either in a new validate method or as part of get_caseblock.
I think ideally this would somehow fit into that pattern, where this function only wraps the JSON in an update class that will require a case_db instance to fully evaluate the caseblock. That way it fits the pattern of the other flows through this code, and uses the same bulk ID lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, how do you feel about dropping the domain parameter of _get_individual_update(), _get_upsert_update() and _get_bulk_updates()?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look at a1616f4 and let me know if that's in line with what you have in mind?
Yeah I love it - that's great!
dropping the domain parameter
Yeah sounds good to me - YAGNI and all that. Made a mention of it in this comment actually. Could also switch user to user_id while you're at it and save some trouble with test mocks.
esoergel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bunch of open comments from my last review, but I think they're all pretty nitpicky and small. I like the big picture approach overall
| if 'case_id' in data: | ||
| raise UserError("UPSERT does not allow case_id to be specified") | ||
| if not data.get('external_id'): | ||
| raise UserError("UPSERT requires external_id to be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two are redundant with validation done in JsonCaseUpsert, right?
corehq/apps/hqcase/api/updates.py
Outdated
| if 'create' not in data: | ||
| raise UserError("A 'create' flag is required for each update.") | ||
| is_creation = data.pop('create') | ||
| if is_creation is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look at a1616f4 and let me know if that's in line with what you have in mind?
Yeah I love it - that's great!
dropping the domain parameter
Yeah sounds good to me - YAGNI and all that. Made a mention of it in this comment actually. Could also switch user to user_id while you're at it and save some trouble with test mocks.
| """Handles UPSERT operations where create/update is determined at lookup time.""" | ||
| external_id = jsonobject.StringProperty(required=True) | ||
|
|
||
| # Required because we might create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this is fine from a user's perspective? We could in theory be a little more flexible and only require it if an update does turn out to be a create, though maybe there's no real need. Like, if they know it's not a create, then maybe they shouldn't be upserting. Is that about what you were thinking?
| case_type = jsonobject.StringProperty(required=True) | ||
| owner_id = jsonobject.StringProperty(required=True) | ||
|
|
||
| _is_case_creation = None # Determined when get_case_id() is called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a temporal coupling - attempting to access _is_case_creation before it's set could cause confusing issues. You could avoid this by raising an error, something like this:
| _is_case_creation = None # Determined when get_case_id() is called | |
| _raw_is_case_creation = ... # Determined when get_case_id() is called | |
| @property | |
| def _is_case_creation(self): | |
| if self._raw_is_case_creation is ...: | |
| raise Exception("_is_case_creation has not yet been initialized") | |
| return self._raw_is_case_creation |
Though I think this currently depends on that ordering - ids_to_find sees None for all upserts, and get_caseblock sees the final value. Another approach might be to use two different things for those - in fact, I see there's already a is_new_case property that just returns this value. Here's an idea:
class JsonCaseCreation:
is_new_case = True
class JsonCaseUpdate:
is_new_case = True
class JsonCaseUpsert:
def is_new_case(self):
... # Something like the suggestion above
ids_to_find = {
update.external_id for update in self.updates
if not isinstance(update, JsonCaseCreation) and not update.case_id
} | {...}
Technical Summary
Some context: SUPPORT-25970
(SaaS/CommCare ticket pending)
Adds upsert functionality on
external_idfor Case API v.2 Bulk Create/Update Cases.See the update to documentation for details and warnings.
Curious to hear your thoughts on this @esoergel
Safety Assurance
Safety story
Automated test coverage
Yes
QA Plan
QA planned. Ticket pending.
Rollback instructions
Labels & Review