Skip to content

DHIS2-20886: Build reliablity into event hooks#22865

Open
cjmamo wants to merge 4 commits intomasterfrom
event-hook-reliability
Open

DHIS2-20886: Build reliablity into event hooks#22865
cjmamo wants to merge 4 commits intomasterfrom
event-hook-reliability

Conversation

@cjmamo
Copy link
Contributor

@cjmamo cjmamo commented Jan 29, 2026

This PR builds at-least-once semantics into event hook delivery. Currently, an event hook is not reliable because it only offers at-most-once delivery guarantees. Furthermore, it might not scale well since it implements a thread-per-request model to emit events. An attempt to illustrate the current design is shown below:

event-hook-before

The proposed design is to:

  1. Persist events into an append-only partitioned outbox table (an outbox table is created per event hook), and
  2. Source the events in batch from the outbox table thanks the new OutboxDrain class.

Partitioning allows us to avoid deletes since old partitions can be simply dropped. An additional table called OutboxLog keeps track of the last processed event. This is to avoid updating individual event records as "processed". Overall, the result should minimal impact on the cost of autovacuuming. An attempt to illustrate the new design is shown below:

event-hook-after

To keep the implementation simple, the first failed event of an event hook is retried every so often: OutboxDrain will not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big).

Additionally, this PR:

  1. Replaces the thread-per-request model with a non-blocking event loop for HTTP requests
  2. Retires all event hook targets except web given the absence of test coverage
  3. Uses platform cache instead of adhoc cache to cache event hook targets

@cjmamo cjmamo requested a review from a team January 29, 2026 07:36
@cjmamo cjmamo self-assigned this Jan 29, 2026
@cjmamo cjmamo requested a review from bobjolliffe January 29, 2026 09:24
@cjmamo cjmamo changed the title Build reliablity into event hooks DRAFT: Build reliablity into event hooks Jan 29, 2026
@cjmamo cjmamo requested a review from mortenoh January 29, 2026 09:26
@cjmamo cjmamo force-pushed the event-hook-reliability branch 4 times, most recently from b480f0a to a9457db Compare February 2, 2026 12:19
@cjmamo cjmamo force-pushed the event-hook-reliability branch 2 times, most recently from df03c06 to dc40561 Compare February 9, 2026 11:56
@cjmamo cjmamo marked this pull request as ready for review February 9, 2026 11:58
@cjmamo cjmamo changed the title DRAFT: Build reliablity into event hooks Build reliablity into event hooks Feb 9, 2026
@cjmamo cjmamo force-pushed the event-hook-reliability branch from dc40561 to 70adbf5 Compare February 9, 2026 12:14
@cjmamo cjmamo changed the title Build reliablity into event hooks DHIS2-20886: Build reliablity into event hooks Feb 9, 2026
@cjmamo cjmamo force-pushed the event-hook-reliability branch 5 times, most recently from c52d9e7 to d9b1818 Compare February 10, 2026 10:33
@cjmamo cjmamo requested a review from a team February 10, 2026 10:33
@cjmamo cjmamo force-pushed the event-hook-reliability branch 3 times, most recently from 9ecd5a9 to fdfef43 Compare February 10, 2026 17:17
…very by (1) persisting events into a partitioned outbox table and (2) sourcing the events from this table thanks the new OutboxDrain class. To keep the implementation simple, the first failed event of an event hook is retried every so often: OutboxDrain will not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big).

replace thread-per-request model with non-blocking event loop

use platform cache instead of adhoc cache to cache event hook targets

BREAKING CHANGE: delete all event hook targets except web given the absence of test coverage
@cjmamo cjmamo force-pushed the event-hook-reliability branch from fdfef43 to 6643674 Compare February 10, 2026 17:19
@david-mackessy
Copy link
Contributor

david-mackessy commented Feb 11, 2026

Firstly, nice diagrams showing the changes and aiming to improve this feature 👍
A few comments (keep in mind I am not familiar with this feature at all):

  1. there is a distinct lack of javadocs on (i) new classes (ii) new methods
    • Having javadocs that give an overview of what something does and decisions made really helps when having to understand what something does
    • what it does/how it does it/why it does it (writer's discretion)
  2. it's not clear how an emitted event can look, is there potential for duplicate messages?
    • consumers need to be idempotent
    • add id on messages (easy way for consumers to know if already processed)
  3. head of line blocking (I note the todo for this)
    • poison message can block entire queue
    • backlog build-up
    • max retry?
    • any way to get around this? does it need manual intervention?
  4. the PR description notes:
    • it might not scale well ...
      • if this said it does not scale well ... with accompanying proof, that would certainly give more weight to the change
    • To keep the implementation simple
      • it looks far less simple than the previous implementation
      • not saying it's not better, it just looks far less simple
  5. is this feature viewed as a part of the system that is used at high volumne by implementations?
    • reports of poor perforamnce at high volume?
  6. is the change being accompanied with docs?
  7. what are the consequences of implementations that currently rely on kafka/jms hooks?
    • was this discussed & agreed to be dropped?
    • erodes implementation's trust with newly-added features (introduced in 40? removed in 43?)
    • not sure a lack of test coverage is a good enough reason on its own to remove something (a lot of the code base would be removed with that reasoning 😂, which might not be a bad thing)
  8. there is no mention of testing in the PR description
    • given that delivery of messages & potential performance issues seem to be the main reasons for the change, I would expect tests showing previous (poor behaviour) vs new (improved behaviour)

point to the next message to be processed instead of the last processed message

do not delete partitions when they contain undelivered events and prevent new partitions from being created when a threshold has been reached: warnings are logged when this state is reached

swallow exceptions that happen when persisting an outbox message so that the web request is not interrupted

have EventHookListener join the web request transaction in order to avoid a situation where the event is persisted but the request transaction was rolled back

rollback and skip failed event hooks in EventHookListener

rename OutboxLog to EventHookOutboxLog and corresponding table

fix: solved issue where handlers are added twice during reload in EventHookSer

test: rewritten EventHookListenerTest as an integration test
@cjmamo
Copy link
Contributor Author

cjmamo commented Feb 12, 2026

it's not clear how an emitted event can look, is there potential for duplicate messages?

  • consumers need to be idempotent
  • add id on messages (easy way for consumers to know if already processed)

Yes, there can be duplicates which is by design. We can look into adding IDs like logical timestamps at a later stage to help with de-duplication but for now the plan is to make it clear in the docs that consumer should expect duplicates.

head of line blocking (I note the todo for this)

  • poison message can block entire queue
  • backlog build-up
  • max retry?
  • any way to get around this? does it need manual intervention?

The last commit adds a change so that it spits out warning when there's a message that can't be delivered for one reason or another. The build-up is avoided by preventing new partitions from being created once a certain threshold is reached. Of course, then this means that new events will be discarded. I'd like if possible to avoid max retries because then that means the event can be dropped which get us in a pickle due to ordering (very relevant for Tracker).

To keep the implementation simple
- it looks far less simple than the previous implementation
- not saying it's not better, it just looks far less simple

It's fair to say that implementing retries while also ensuring that the correct order is kept would have been much more complicated to get it right.

is this feature viewed as a part of the system that is used at high volumne by implementations?

  • reports of poor perforamnce at high volume?

Possibly. When we start supporting event hooks for Tracker, we need to assume that there can be many creates and updates happening in a short amount of time.

is the change being accompanied with docs?

Yes. For sure, https://docs.dhis2.org/en/develop/using-the-api/dhis-core-version-master/event-hooks.html will need to be updated.

what are the consequences of implementations that currently rely on kafka/jms hooks?

  • was this discussed & agreed to be dropped?
  • erodes implementation's trust with newly-added features (introduced in 40? removed in 43?)
  • not sure a lack of test coverage is a good enough reason on its own to remove something (a lot of the code base would be removed with that reasoning 😂, which might not be a bad thing)

I discussed this with @KaiVandivier and @amcgee. I believe we are in agreement. There is a need for web hooks as I keep hearing from the field. The need for JMS/Kafka hooks is less apparent. Luckily for us, event hooks are totally broken in v42 (https://dhis2.atlassian.net/browse/DHIS2-20744) and I don't think we ever really promoted this feature.

there is no mention of testing in the PR description:

  • given that delivery of messages & potential performance issues seem to be the main reasons for the change, I would expect tests showing previous (poor behaviour) vs new (improved behaviour)

I've expanded test coverage, but yes, we can do better. Wrt performance, I'll explore how we can benchmark this though it will be in a separate PR. In the meantime, I'll try to manually benchmark it.

@sonarqubecloud
Copy link

add extra partitions during rotation as a safeguard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants