DHIS2-20886: Build reliablity into event hooks#22865
Conversation
b480f0a to
a9457db
Compare
df03c06 to
dc40561
Compare
dc40561 to
70adbf5
Compare
c52d9e7 to
d9b1818
Compare
9ecd5a9 to
fdfef43
Compare
…very by (1) persisting events into a partitioned outbox table and (2) sourcing the events from this table thanks the new OutboxDrain class. To keep the implementation simple, the first failed event of an event hook is retried every so often: OutboxDrain will not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big). replace thread-per-request model with non-blocking event loop use platform cache instead of adhoc cache to cache event hook targets BREAKING CHANGE: delete all event hook targets except web given the absence of test coverage
fdfef43 to
6643674
Compare
|
Firstly, nice diagrams showing the changes and aiming to improve this feature 👍
|
...ervices/dhis-service-event-hook/src/main/java/org/hisp/dhis/eventhook/OutboxRotationJob.java
Outdated
Show resolved
Hide resolved
...migration/src/main/resources/org/hisp/dhis/db/migration/2.43/V2_43_46__create_outbox_log.sql
Outdated
Show resolved
Hide resolved
point to the next message to be processed instead of the last processed message do not delete partitions when they contain undelivered events and prevent new partitions from being created when a threshold has been reached: warnings are logged when this state is reached swallow exceptions that happen when persisting an outbox message so that the web request is not interrupted have EventHookListener join the web request transaction in order to avoid a situation where the event is persisted but the request transaction was rolled back rollback and skip failed event hooks in EventHookListener rename OutboxLog to EventHookOutboxLog and corresponding table fix: solved issue where handlers are added twice during reload in EventHookSer test: rewritten EventHookListenerTest as an integration test
Yes, there can be duplicates which is by design. We can look into adding IDs like logical timestamps at a later stage to help with de-duplication but for now the plan is to make it clear in the docs that consumer should expect duplicates.
The last commit adds a change so that it spits out warning when there's a message that can't be delivered for one reason or another. The build-up is avoided by preventing new partitions from being created once a certain threshold is reached. Of course, then this means that new events will be discarded. I'd like if possible to avoid max retries because then that means the event can be dropped which get us in a pickle due to ordering (very relevant for Tracker).
It's fair to say that implementing retries while also ensuring that the correct order is kept would have been much more complicated to get it right.
Possibly. When we start supporting event hooks for Tracker, we need to assume that there can be many creates and updates happening in a short amount of time.
Yes. For sure, https://docs.dhis2.org/en/develop/using-the-api/dhis-core-version-master/event-hooks.html will need to be updated.
I discussed this with @KaiVandivier and @amcgee. I believe we are in agreement. There is a need for web hooks as I keep hearing from the field. The need for JMS/Kafka hooks is less apparent. Luckily for us, event hooks are totally broken in v42 (https://dhis2.atlassian.net/browse/DHIS2-20744) and I don't think we ever really promoted this feature.
I've expanded test coverage, but yes, we can do better. Wrt performance, I'll explore how we can benchmark this though it will be in a separate PR. In the meantime, I'll try to manually benchmark it. |
|
add extra partitions during rotation as a safeguard



This PR builds at-least-once semantics into event hook delivery. Currently, an event hook is not reliable because it only offers at-most-once delivery guarantees. Furthermore, it might not scale well since it implements a thread-per-request model to emit events. An attempt to illustrate the current design is shown below:
The proposed design is to:
OutboxDrainclass.Partitioning allows us to avoid deletes since old partitions can be simply dropped. An additional table called
OutboxLogkeeps track of the last processed event. This is to avoid updating individual event records as "processed". Overall, the result should minimal impact on the cost of autovacuuming. An attempt to illustrate the new design is shown below:To keep the implementation simple, the first failed event of an event hook is retried every so often:
OutboxDrainwill not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big).Additionally, this PR: