fix: reconcile remote events after run #1865

enyst · 2026-01-30T23:09:29Z

Summary

Treat REST as the source of truth for remote events after a run finishes.
Add a post-run reconciliation loop to fetch missed events without depending on WebSocket timing.
Refactor remote event fetching to support paged incremental syncs.

Testing

uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py
uv run pytest tests/cross/test_remote_conversation_live_server.py::test_post_run_reconcile_needed_under_ws_callback_lag

@enyst can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:a0d22b0-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a0d22b0-python \
  ghcr.io/openhands/agent-server:a0d22b0-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a0d22b0-golang-amd64
ghcr.io/openhands/agent-server:a0d22b0-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:a0d22b0-golang-arm64
ghcr.io/openhands/agent-server:a0d22b0-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:a0d22b0-java-amd64
ghcr.io/openhands/agent-server:a0d22b0-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:a0d22b0-java-arm64
ghcr.io/openhands/agent-server:a0d22b0-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:a0d22b0-python-amd64
ghcr.io/openhands/agent-server:a0d22b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:a0d22b0-python-arm64
ghcr.io/openhands/agent-server:a0d22b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:a0d22b0-golang
ghcr.io/openhands/agent-server:a0d22b0-java
ghcr.io/openhands/agent-server:a0d22b0-python

About Multi-Architecture Support

Each variant tag (e.g., a0d22b0-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., a0d22b0-python-amd64) are also available if needed

all-hands-bot

Solid refactoring of event reconciliation logic! Found a few important issues around error handling and edge cases that should be addressed before merge.

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

smolpaws · 2026-01-30T23:37:26Z

🔴 /codereview-roasted: Mostly good, but watch the hidden costs

Taste Rating: 🟡 Acceptable — The core idea (REST is truth, WS is best-effort) is sound. The implementation is simple and direct. But you’re paying for it with extra REST churn and fuzzy completeness guarantees.

[CRITICAL ISSUES] (Must fix)

None. This doesn’t break userspace and it actually fixes the hang/WS race.

[IMPROVEMENT OPPORTUNITIES] (Should fix)

[remote_conversation.py, Lines ~901–923]
Reconcile loop cost: You always do a full REST reconcile + up to 5 paged reconciles. That’s potentially a lot of REST calls per run, even when WS delivered everything. It’s fine for correctness, but it’s not cheap.
Suggestion: consider a short-circuit (e.g., skip the full reconcile if no WS delay signals or if last_event_id didn’t move during polling).

[remote_conversation.py, Lines ~304–314]
Pagination ambiguity workaround: Dropping the page_id event by filtering is safe, but it assumes IDs are unique and that the server won’t return duplicates besides the cursor. That’s probably fine, but it’s still a “best effort” approach.
Suggestion: document the pagination semantics or add a helper to the server API so the client doesn’t guess.

[STYLE NOTES]

[remote_conversation.py, Lines ~904–923]
The warning on max_cycles is good, but the threshold is still arbitrary. It would be cleaner to cap by elapsed time only (already tracked) and log when deadline hits, not just cycle count.

VERDICT: ✅ Worth merging

It removes the WS-only liveness dependency without forcing a redesign. It’s not elegant, but it’s pragmatic. Just keep an eye on REST overhead and make pagination semantics explicit in the long run.

Key insight: This is a correctness-first fix, not a throughput fix. That’s fine — but own the tradeoff.

enyst · 2026-01-31T22:53:45Z

@OpenHands Get yourself gh if you don't have it yet. Then read all comments and inline reviews and let's fix them.

Consider this first:

[remote_conversation.py, Lines ~901–923]
Reconcile loop cost: You always do a full REST reconcile + up to 5 paged reconciles. That’s potentially a lot of REST calls per run, even when WS delivered everything. It’s fine for correctness, but it’s not cheap.
Suggestion: consider a short-circuit (e.g., skip the full reconcile if no WS delay signals or if last_event_id didn’t move during polling).

Is this really so? A full REST reconcile plus 5 paged is unacceptable. Why not use last_event_id ?

Note: history can have a lot of events, e.g. 60k events each up to 30k tokens!

Please reconsider and use carefully the REST fallback. We don't want to lose events, but it has to be best effort, while resilience and robustness of the application (well the sdk and any client app) is far more important. Note that first ~10 and last ~10 events are relied upon, and we can have the rest as best as possible.

Then, once you solve this important problem, please look at the other issues raised in comments and inline threads. Note that they are raised by AIs, so they might be wrong, we need to validate each claim/suggestion. If you think it needs fixed, fix it, if not just reply with your reasoning.

Commit (run precommit too) and push.

openhands-ai · 2026-01-31T22:53:53Z

I'm on it! enyst can track my progress at all-hands.dev

Co-authored-by: openhands <[email protected]>

openhands-ai · 2026-01-31T23:03:31Z

Checklist review:

Addressed the REST reconcile cost concerns by switching to incremental (page_id) reconciliation after WS readiness and post-run, avoiding full-history pulls.
Kept pagination handling safe by explicitly dropping any returned page_id event (inclusive pagination).
Adjusted warnings to be time-based with a max-cycles safety guard.
Ran pre-commit and requested tests; committed and pushed.
Changes are concise and limited to remote_conversation.py.

Git status/diff:

Working tree clean; last commit modifies only remote_conversation.py with 22 insertions / 8 deletions.

Summary of new changes since last update:

Post-run reconciliation now uses page_id-only reconciliation (no initial full REST fetch), exits early on no new events, and logs deadline vs cycle-cap warnings.
WS-ready reconciliation now uses last_event_id (incremental) when available.

Already pushed to fix/ws-rest-event-sync.

View full conversation

openhands-ai · 2026-01-31T23:48:39Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1865 at branch `fix/ws-rest-event-sync`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

smolpaws · 2026-01-31T23:49:50Z

WebSocket-first with REST backstop (post-run tail reconcile)

Below is a sequence view of the current flow. I’m calling it WebSocket-first with REST backstop because WS is the primary stream and REST only fills gaps when timing goes sideways.

sequenceDiagram
    participant Client
    participant Server
    participant WS as WebSocket
    participant REST as REST /events/search

    Client->>REST: Initial full sync (startup only)
    Client->>WS: Subscribe to events
    WS-->>Client: ConversationStateUpdateEvent (ready signal)
    Client->>REST: Reconcile from last_event_id (handshake gap)

    Note over Client,Server: Normal run
    Client->>Server: POST /run
    loop during run
        WS-->>Client: Event stream (Action/Observation/Message)
    end

    Server-->>Client: run completes (status polling)
    Client->>REST: Post-run reconcile from last_event_id (tail backstop)
    alt no new events
        Client-->>Client: exit quickly (no extra calls)
    else new events appear
        Client->>REST: additional page_id reconciles (bounded)
    end

Execution flow (blunt version)

WS is the main pipe. It’s fast and cheap but not perfectly reliable at the end of a run (WS callback lag can drop tail events).
REST is the backstop. We do one initial full sync at construction, then incremental reconciles using page_id (last_event_id) after WS is ready and after the run finishes.
Post-run reconcile is bounded: it stops when nothing new shows up, or after a short cycle cap / timeout.

Problem solved

We were missing “tail” events when the run finished before the WS callback delivered them. That caused incomplete event logs.
The new flow ensures we don’t drop tail events while keeping REST usage best‑effort and bounded.

Tradeoffs (also blunt)

Pros: No more WS tail race; REST is only used when necessary, and only incrementally.
Cons: Still extra REST calls after a run (but bounded). If the server has huge histories, REST reconcile can still be heavy if the WS missed a lot, but we’re no longer doing full-history pulls every time.
Reality check: This is a correctness/robustness fix, not a throughput fix. It’s cheap in the common case, but it can cost more when the WS misbehaves. That’s the price of not losing events.

all-hands-bot

Overall solid refactoring that improves event synchronization reliability. The incremental reconciliation approach is well-designed. Main concerns are around error handling transparency and edge case documentation.

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

all-hands-bot · 2026-02-01T00:18:03Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+            "Performing reconciliation sync for conversation %s (page_id=%s)",
+            self._conversation_id,
+            page_id,
        )


🟡 Suggestion: The ignore_errors=True here means reconciliation failures are silent. Since this is called during _finalize_events_after_run, consider whether you want to:

Add retry logic for transient failures

Distinguish between transient network errors and permanent failures

Surface errors to the caller somehow (return value, exception, callback)

all-hands-bot · 2026-02-01T00:18:03Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py


-            if not data.get("next_page_id"):
-                break
-            page_id = data["next_page_id"]
+        if page_id:
+            # The API may return the page_id event again depending on whether
+            # pagination is inclusive. Drop it to avoid duplicates regardless


🟡 Suggestion: The comment indicates uncertainty about whether the API returns the page_id event (inclusive vs exclusive pagination). Consider:

Document the actual API behavior - test and confirm whether pagination is inclusive/exclusive

Add a test case that verifies this filtering works correctly for both scenarios

If this uncertainty is due to API versioning, consider checking the API version or making the behavior explicit in API documentation

This filtering is safe (filters 0 or 1 events) but the uncertainty suggests potential for bugs if API behavior changes.

all-hands-bot · 2026-02-01T00:18:03Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+        max_cycles: int = 5,
+    ) -> None:
+        """Ensure REST-backed events are fully synced after run completion."""
+        if timeout <= 0:


🟢 Nit: This early return when timeout <= 0 is reasonable, but consider:

Should we log a debug message here?

Should negative timeouts raise ValueError, or is silent no-op acceptable?

Document this behavior in the docstring

all-hands-bot · 2026-02-01T00:18:03Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+            logger.warning(
+                "Event reconciliation reached deadline (timeout=%.1fs); "
+                "results may be incomplete",
+                timeout,
+            )
+        elif cycles >= max_cycles:
+            logger.warning(
+                "Event reconciliation hit max cycles (%d); results may be incomplete",
+                max_cycles,
+            )


🟡 Suggestion: The elif structure means only one warning is shown if both timeout AND max_cycles are hit simultaneously. This could be confusing for debugging. Consider:

Suggested change

logger.warning(

"Event reconciliation reached deadline (timeout=%.1fs); "

"results may be incomplete",

timeout,

)

elif cycles >= max_cycles:

logger.warning(

"Event reconciliation hit max cycles (%d); results may be incomplete",

max_cycles,

)

if time.monotonic() >= deadline:

logger.warning(

"Event reconciliation reached deadline (timeout=%.1fs); "

"results may be incomplete",

timeout,

)

if cycles >= max_cycles:

logger.warning(

"Event reconciliation hit max cycles (%d); results may be incomplete",

max_cycles,

)

This way both warnings are shown if both conditions are true, giving complete diagnostic information.

all-hands-bot · 2026-02-01T00:18:03Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+    def _finalize_events_after_run(
+        self,
+        timeout: float,
+        settle_interval: float = 0.2,


🟢 Nit: The hardcoded settle_interval=0.2 and max_cycles=5 values seem reasonable, but consider:

Making these class-level constants with descriptive names (e.g., DEFAULT_SETTLE_INTERVAL, MAX_RECONCILE_CYCLES)

This makes it easier to tune these values in tests or for different deployment scenarios

Could add an environment variable override for testing edge cases

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

all-hands-bot · 2026-02-01T00:18:04Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+        last_event_id = self._state.events.get_last_event_id()
+        if last_event_id:
+            self._state.events.reconcile(page_id=last_event_id)
+        else:
+            self._state.events.reconcile()


✅ Nice: Good improvement to use incremental reconciliation from the last event ID instead of fetching all events again. This reduces unnecessary data transfer and processing.

fix: reconcile remote events after run

d120607

enyst changed the title ~~DRAFT: fix: reconcile remote events after run~~ fix: reconcile remote events after run Jan 30, 2026

enyst marked this pull request as ready for review January 30, 2026 23:20

all-hands-bot reviewed Jan 30, 2026

View reviewed changes

fix: tighten remote event reconciliation

5e1566d

enyst marked this pull request as draft January 31, 2026 22:53

fix: reduce post-run rest reconciliation

e853c77

Co-authored-by: openhands <[email protected]>

Merge branch 'main' into fix/ws-rest-event-sync

26821a2

enyst marked this pull request as ready for review February 1, 2026 00:15

all-hands-bot reviewed Feb 1, 2026

View reviewed changes

fix: reconcile remote events after run #1865

Are you sure you want to change the base?

fix: reconcile remote events after run #1865

Conversation

enyst commented Jan 30, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smolpaws commented Jan 30, 2026

🔴 /codereview-roasted: Mostly good, but watch the hidden costs

[CRITICAL ISSUES] (Must fix)

[IMPROVEMENT OPPORTUNITIES] (Should fix)

[STYLE NOTES]

VERDICT: ✅ Worth merging

Uh oh!

enyst commented Jan 31, 2026

Uh oh!

openhands-ai bot commented Jan 31, 2026

Uh oh!

openhands-ai bot commented Jan 31, 2026

Uh oh!

openhands-ai bot commented Jan 31, 2026

Uh oh!

smolpaws commented Jan 31, 2026

WebSocket-first with REST backstop (post-run tail reconcile)

Execution flow (blunt version)

Problem solved

Tradeoffs (also blunt)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enyst commented Jan 30, 2026 •

edited by github-actions bot

Loading