DHIS2-20886: Build reliablity into event hooks by cjmamo · Pull Request #22865 · dhis2/dhis2-core

cjmamo · 2026-01-29T07:36:44Z

This PR builds at-least-once semantics into event hook delivery. Currently, an event hook is not reliable because it only offers at-most-once delivery guarantees. Furthermore, it might not scale well since it implements a thread-per-request model to emit events. An attempt to illustrate the current design is shown below:

The proposed design is to:

Persist events into an append-only partitioned outbox table (an outbox table is created per event hook), and
Source the events in batch from the outbox table thanks the new OutboxDrain class.

Partitioning allows us to avoid deletes since old partitions can be simply dropped. An additional table called OutboxLog keeps track of the last processed event. This is to avoid updating individual event records as "processed". Overall, the result should minimal impact on the cost of autovacuuming. An attempt to illustrate the new design is shown below:

To keep the implementation simple, the first failed event of an event hook is retried every so often: OutboxDrain will not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big).

Additionally, this PR:

Replaces the thread-per-request model with a non-blocking event loop for HTTP requests
Retires all event hook targets except web given the absence of test coverage
Uses platform cache instead of adhoc cache to cache event hook targets

…very by (1) persisting events into a partitioned outbox table and (2) sourcing the events from this table thanks the new OutboxDrain class. To keep the implementation simple, the first failed event of an event hook is retried every so often: OutboxDrain will not progress onto the next event until the failed event is delivered (TODO: figure out what to do if the event backlog grows too big). replace thread-per-request model with non-blocking event loop use platform cache instead of adhoc cache to cache event hook targets BREAKING CHANGE: delete all event hook targets except web given the absence of test coverage

david-mackessy · 2026-02-11T10:49:42Z

Firstly, nice diagrams showing the changes and aiming to improve this feature 👍
A few comments (keep in mind I am not familiar with this feature at all):

there is a distinct lack of javadocs on (i) new classes (ii) new methods
- Having javadocs that give an overview of what something does and decisions made really helps when having to understand what something does
- what it does/how it does it/why it does it (writer's discretion)
it's not clear how an emitted event can look, is there potential for duplicate messages?
- consumers need to be idempotent
- add id on messages (easy way for consumers to know if already processed)
head of line blocking (I note the todo for this)
- poison message can block entire queue
- backlog build-up
- max retry?
- any way to get around this? does it need manual intervention?
the PR description notes:
- it might not scale well ...
  - if this said it does not scale well ... with accompanying proof, that would certainly give more weight to the change
- To keep the implementation simple
  - it looks far less simple than the previous implementation
  - not saying it's not better, it just looks far less simple
is this feature viewed as a part of the system that is used at high volumne by implementations?
- reports of poor perforamnce at high volume?
is the change being accompanied with docs?
what are the consequences of implementations that currently rely on kafka/jms hooks?
- was this discussed & agreed to be dropped?
- erodes implementation's trust with newly-added features (introduced in 40? removed in 43?)
- not sure a lack of test coverage is a good enough reason on its own to remove something (a lot of the code base would be removed with that reasoning 😂, which might not be a bad thing)
there is no mention of testing in the PR description
- given that delivery of messages & potential performance issues seem to be the main reasons for the change, I would expect tests showing previous (poor behaviour) vs new (improved behaviour)

...ervices/dhis-service-event-hook/src/main/java/org/hisp/dhis/eventhook/OutboxRotationJob.java

...migration/src/main/resources/org/hisp/dhis/db/migration/2.43/V2_43_46__create_outbox_log.sql

point to the next message to be processed instead of the last processed message do not delete partitions when they contain undelivered events and prevent new partitions from being created when a threshold has been reached: warnings are logged when this state is reached swallow exceptions that happen when persisting an outbox message so that the web request is not interrupted have EventHookListener join the web request transaction in order to avoid a situation where the event is persisted but the request transaction was rolled back rollback and skip failed event hooks in EventHookListener rename OutboxLog to EventHookOutboxLog and corresponding table fix: solved issue where handlers are added twice during reload in EventHookSer test: rewritten EventHookListenerTest as an integration test

cjmamo · 2026-02-12T18:19:05Z

it's not clear how an emitted event can look, is there potential for duplicate messages?

consumers need to be idempotent

add id on messages (easy way for consumers to know if already processed)

Yes, there can be duplicates which is by design. We can look into adding IDs like logical timestamps at a later stage to help with de-duplication but for now the plan is to make it clear in the docs that consumer should expect duplicates.

head of line blocking (I note the todo for this)

poison message can block entire queue

backlog build-up

max retry?

any way to get around this? does it need manual intervention?

The last commit adds a change so that it spits out warning when there's a message that can't be delivered for one reason or another. The build-up is avoided by preventing new partitions from being created once a certain threshold is reached. Of course, then this means that new events will be discarded. I'd like if possible to avoid max retries because then that means the event can be dropped which get us in a pickle due to ordering (very relevant for Tracker).

To keep the implementation simple
- it looks far less simple than the previous implementation
- not saying it's not better, it just looks far less simple

It's fair to say that implementing retries while also ensuring that the correct order is kept would have been much more complicated to get it right.

is this feature viewed as a part of the system that is used at high volumne by implementations?

reports of poor perforamnce at high volume?

Possibly. When we start supporting event hooks for Tracker, we need to assume that there can be many creates and updates happening in a short amount of time.

is the change being accompanied with docs?

Yes. For sure, https://docs.dhis2.org/en/develop/using-the-api/dhis-core-version-master/event-hooks.html will need to be updated.

what are the consequences of implementations that currently rely on kafka/jms hooks?

was this discussed & agreed to be dropped?

erodes implementation's trust with newly-added features (introduced in 40? removed in 43?)

not sure a lack of test coverage is a good enough reason on its own to remove something (a lot of the code base would be removed with that reasoning 😂, which might not be a bad thing)

I discussed this with @KaiVandivier and @amcgee. I believe we are in agreement. There is a need for web hooks as I keep hearing from the field. The need for JMS/Kafka hooks is less apparent. Luckily for us, event hooks are totally broken in v42 (https://dhis2.atlassian.net/browse/DHIS2-20744) and I don't think we ever really promoted this feature.

there is no mention of testing in the PR description:

given that delivery of messages & potential performance issues seem to be the main reasons for the change, I would expect tests showing previous (poor behaviour) vs new (improved behaviour)

I've expanded test coverage, but yes, we can do better. Wrt performance, I'll explore how we can benchmark this though it will be in a separate PR. In the meantime, I'll try to manually benchmark it.

sonarqubecloud · 2026-02-12T18:31:52Z

Quality Gate passed

Issues
0 New issues
3 Accepted issues

Measures
10 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

add extra partitions during rotation as a safeguard

cjmamo requested a review from a team January 29, 2026 07:36

cjmamo self-assigned this Jan 29, 2026

cjmamo requested a review from bobjolliffe January 29, 2026 09:24

cjmamo changed the title ~~Build reliablity into event hooks~~ DRAFT: Build reliablity into event hooks Jan 29, 2026

cjmamo requested a review from mortenoh January 29, 2026 09:26

cjmamo force-pushed the event-hook-reliability branch 4 times, most recently from b480f0a to a9457db Compare February 2, 2026 12:19

cjmamo force-pushed the event-hook-reliability branch 2 times, most recently from df03c06 to dc40561 Compare February 9, 2026 11:56

cjmamo marked this pull request as ready for review February 9, 2026 11:58

cjmamo changed the title ~~DRAFT: Build reliablity into event hooks~~ Build reliablity into event hooks Feb 9, 2026

cjmamo force-pushed the event-hook-reliability branch from dc40561 to 70adbf5 Compare February 9, 2026 12:14

cjmamo changed the title ~~Build reliablity into event hooks~~ DHIS2-20886: Build reliablity into event hooks Feb 9, 2026

cjmamo force-pushed the event-hook-reliability branch 5 times, most recently from c52d9e7 to d9b1818 Compare February 10, 2026 10:33

cjmamo requested a review from a team February 10, 2026 10:33

cjmamo force-pushed the event-hook-reliability branch 3 times, most recently from 9ecd5a9 to fdfef43 Compare February 10, 2026 17:17

cjmamo force-pushed the event-hook-reliability branch from fdfef43 to 6643674 Compare February 10, 2026 17:19

david-mackessy reviewed Feb 11, 2026

View reviewed changes

...ervices/dhis-service-event-hook/src/main/java/org/hisp/dhis/eventhook/OutboxRotationJob.java Outdated Show resolved Hide resolved

david-mackessy reviewed Feb 12, 2026

View reviewed changes

...migration/src/main/resources/org/hisp/dhis/db/migration/2.43/V2_43_46__create_outbox_log.sql Outdated Show resolved Hide resolved

fix: sort out code smells

e8074d3

refactor: somewhat simplify partition rotation

18059b3

add extra partitions during rotation as a safeguard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DHIS2-20886: Build reliablity into event hooks#22865

DHIS2-20886: Build reliablity into event hooks#22865
cjmamo wants to merge 4 commits intomasterfrom
event-hook-reliability

cjmamo commented Jan 29, 2026 •

edited

Loading

Uh oh!

david-mackessy commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cjmamo commented Feb 12, 2026

Uh oh!

sonarqubecloud bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cjmamo commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-mackessy commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjmamo commented Feb 12, 2026

Uh oh!

sonarqubecloud bot commented Feb 12, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjmamo commented Jan 29, 2026 •

edited

Loading

david-mackessy commented Feb 11, 2026 •

edited

Loading