Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Jan 30, 2026

Summary

  • Treat REST as the source of truth for remote events after a run finishes.
  • Add a post-run reconciliation loop to fetch missed events without depending on WebSocket timing.
  • Refactor remote event fetching to support paged incremental syncs.

Testing

  • uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py
  • uv run pytest tests/cross/test_remote_conversation_live_server.py::test_post_run_reconcile_needed_under_ws_callback_lag

@enyst can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:a0d22b0-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a0d22b0-python \
  ghcr.io/openhands/agent-server:a0d22b0-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a0d22b0-golang-amd64
ghcr.io/openhands/agent-server:a0d22b0-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:a0d22b0-golang-arm64
ghcr.io/openhands/agent-server:a0d22b0-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:a0d22b0-java-amd64
ghcr.io/openhands/agent-server:a0d22b0-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:a0d22b0-java-arm64
ghcr.io/openhands/agent-server:a0d22b0-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:a0d22b0-python-amd64
ghcr.io/openhands/agent-server:a0d22b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:a0d22b0-python-arm64
ghcr.io/openhands/agent-server:a0d22b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:a0d22b0-golang
ghcr.io/openhands/agent-server:a0d22b0-java
ghcr.io/openhands/agent-server:a0d22b0-python

About Multi-Architecture Support

  • Each variant tag (e.g., a0d22b0-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., a0d22b0-python-amd64) are also available if needed

@enyst enyst changed the title DRAFT: fix: reconcile remote events after run fix: reconcile remote events after run Jan 30, 2026
@enyst enyst marked this pull request as ready for review January 30, 2026 23:20
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactoring of event reconciliation logic! Found a few important issues around error handling and edge cases that should be addressed before merge.

@smolpaws
Copy link

🔴 /codereview-roasted: Mostly good, but watch the hidden costs

Taste Rating: 🟡 Acceptable — The core idea (REST is truth, WS is best-effort) is sound. The implementation is simple and direct. But you’re paying for it with extra REST churn and fuzzy completeness guarantees.


[CRITICAL ISSUES] (Must fix)

None. This doesn’t break userspace and it actually fixes the hang/WS race.


[IMPROVEMENT OPPORTUNITIES] (Should fix)

[remote_conversation.py, Lines ~901–923]
Reconcile loop cost: You always do a full REST reconcile + up to 5 paged reconciles. That’s potentially a lot of REST calls per run, even when WS delivered everything. It’s fine for correctness, but it’s not cheap.
Suggestion: consider a short-circuit (e.g., skip the full reconcile if no WS delay signals or if last_event_id didn’t move during polling).

[remote_conversation.py, Lines ~304–314]
Pagination ambiguity workaround: Dropping the page_id event by filtering is safe, but it assumes IDs are unique and that the server won’t return duplicates besides the cursor. That’s probably fine, but it’s still a “best effort” approach.
Suggestion: document the pagination semantics or add a helper to the server API so the client doesn’t guess.


[STYLE NOTES]

[remote_conversation.py, Lines ~904–923]
The warning on max_cycles is good, but the threshold is still arbitrary. It would be cleaner to cap by elapsed time only (already tracked) and log when deadline hits, not just cycle count.


VERDICT: ✅ Worth merging

It removes the WS-only liveness dependency without forcing a redesign. It’s not elegant, but it’s pragmatic. Just keep an eye on REST overhead and make pagination semantics explicit in the long run.

Key insight: This is a correctness-first fix, not a throughput fix. That’s fine — but own the tradeoff.

@enyst
Copy link
Collaborator Author

enyst commented Jan 31, 2026

@OpenHands Get yourself gh if you don't have it yet. Then read all comments and inline reviews and let's fix them.

Consider this first:

[remote_conversation.py, Lines ~901–923]
Reconcile loop cost: You always do a full REST reconcile + up to 5 paged reconciles. That’s potentially a lot of REST calls per run, even when WS delivered everything. It’s fine for correctness, but it’s not cheap.
Suggestion: consider a short-circuit (e.g., skip the full reconcile if no WS delay signals or if last_event_id didn’t move during polling).

Is this really so? A full REST reconcile plus 5 paged is unacceptable. Why not use last_event_id ?

Note: history can have a lot of events, e.g. 60k events each up to 30k tokens!

Please reconsider and use carefully the REST fallback. We don't want to lose events, but it has to be best effort, while resilience and robustness of the application (well the sdk and any client app) is far more important. Note that first ~10 and last ~10 events are relied upon, and we can have the rest as best as possible.

Then, once you solve this important problem, please look at the other issues raised in comments and inline threads. Note that they are raised by AIs, so they might be wrong, we need to validate each claim/suggestion. If you think it needs fixed, fix it, if not just reply with your reasoning.

Commit (run precommit too) and push.

@enyst enyst marked this pull request as draft January 31, 2026 22:53
@openhands-ai
Copy link

openhands-ai bot commented Jan 31, 2026

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 31, 2026

Checklist review:

  • Addressed the REST reconcile cost concerns by switching to incremental (page_id) reconciliation after WS readiness and post-run, avoiding full-history pulls.
  • Kept pagination handling safe by explicitly dropping any returned page_id event (inclusive pagination).
  • Adjusted warnings to be time-based with a max-cycles safety guard.
  • Ran pre-commit and requested tests; committed and pushed.
  • Changes are concise and limited to remote_conversation.py.

Git status/diff:

  • Working tree clean; last commit modifies only remote_conversation.py with 22 insertions / 8 deletions.

Summary of new changes since last update:

  • Post-run reconciliation now uses page_id-only reconciliation (no initial full REST fetch), exits early on no new events, and logs deadline vs cycle-cap warnings.
  • WS-ready reconciliation now uses last_event_id (incremental) when available.

Already pushed to fix/ws-rest-event-sync.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Jan 31, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1865 at branch `fix/ws-rest-event-sync`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@smolpaws
Copy link

WebSocket-first with REST backstop (post-run tail reconcile)

Below is a sequence view of the current flow. I’m calling it WebSocket-first with REST backstop because WS is the primary stream and REST only fills gaps when timing goes sideways.

sequenceDiagram
    participant Client
    participant Server
    participant WS as WebSocket
    participant REST as REST /events/search

    Client->>REST: Initial full sync (startup only)
    Client->>WS: Subscribe to events
    WS-->>Client: ConversationStateUpdateEvent (ready signal)
    Client->>REST: Reconcile from last_event_id (handshake gap)

    Note over Client,Server: Normal run
    Client->>Server: POST /run
    loop during run
        WS-->>Client: Event stream (Action/Observation/Message)
    end

    Server-->>Client: run completes (status polling)
    Client->>REST: Post-run reconcile from last_event_id (tail backstop)
    alt no new events
        Client-->>Client: exit quickly (no extra calls)
    else new events appear
        Client->>REST: additional page_id reconciles (bounded)
    end
Loading

Execution flow (blunt version)

  • WS is the main pipe. It’s fast and cheap but not perfectly reliable at the end of a run (WS callback lag can drop tail events).
  • REST is the backstop. We do one initial full sync at construction, then incremental reconciles using page_id (last_event_id) after WS is ready and after the run finishes.
  • Post-run reconcile is bounded: it stops when nothing new shows up, or after a short cycle cap / timeout.

Problem solved

  • We were missing “tail” events when the run finished before the WS callback delivered them. That caused incomplete event logs.
  • The new flow ensures we don’t drop tail events while keeping REST usage best‑effort and bounded.

Tradeoffs (also blunt)

  • Pros: No more WS tail race; REST is only used when necessary, and only incrementally.
  • Cons: Still extra REST calls after a run (but bounded). If the server has huge histories, REST reconcile can still be heavy if the WS missed a lot, but we’re no longer doing full-history pulls every time.
  • Reality check: This is a correctness/robustness fix, not a throughput fix. It’s cheap in the common case, but it can cost more when the WS misbehaves. That’s the price of not losing events.

@enyst enyst marked this pull request as ready for review February 1, 2026 00:15
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid refactoring that improves event synchronization reliability. The incremental reconciliation approach is well-designed. Main concerns are around error handling transparency and edge case documentation.

"Performing reconciliation sync for conversation %s (page_id=%s)",
self._conversation_id,
page_id,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The ignore_errors=True here means reconciliation failures are silent. Since this is called during _finalize_events_after_run, consider whether you want to:

  • Add retry logic for transient failures
  • Distinguish between transient network errors and permanent failures
  • Surface errors to the caller somehow (return value, exception, callback)

Comment on lines 308 to +311

if not data.get("next_page_id"):
break
page_id = data["next_page_id"]
if page_id:
# The API may return the page_id event again depending on whether
# pagination is inclusive. Drop it to avoid duplicates regardless
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The comment indicates uncertainty about whether the API returns the page_id event (inclusive vs exclusive pagination). Consider:

  1. Document the actual API behavior - test and confirm whether pagination is inclusive/exclusive
  2. Add a test case that verifies this filtering works correctly for both scenarios
  3. If this uncertainty is due to API versioning, consider checking the API version or making the behavior explicit in API documentation

This filtering is safe (filters 0 or 1 events) but the uncertainty suggests potential for bugs if API behavior changes.

max_cycles: int = 5,
) -> None:
"""Ensure REST-backed events are fully synced after run completion."""
if timeout <= 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: This early return when timeout <= 0 is reasonable, but consider:

  • Should we log a debug message here?
  • Should negative timeouts raise ValueError, or is silent no-op acceptable?
  • Document this behavior in the docstring

Comment on lines +928 to +937
logger.warning(
"Event reconciliation reached deadline (timeout=%.1fs); "
"results may be incomplete",
timeout,
)
elif cycles >= max_cycles:
logger.warning(
"Event reconciliation hit max cycles (%d); results may be incomplete",
max_cycles,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The elif structure means only one warning is shown if both timeout AND max_cycles are hit simultaneously. This could be confusing for debugging. Consider:

Suggested change
logger.warning(
"Event reconciliation reached deadline (timeout=%.1fs); "
"results may be incomplete",
timeout,
)
elif cycles >= max_cycles:
logger.warning(
"Event reconciliation hit max cycles (%d); results may be incomplete",
max_cycles,
)
if time.monotonic() >= deadline:
logger.warning(
"Event reconciliation reached deadline (timeout=%.1fs); "
"results may be incomplete",
timeout,
)
if cycles >= max_cycles:
logger.warning(
"Event reconciliation hit max cycles (%d); results may be incomplete",
max_cycles,
)

This way both warnings are shown if both conditions are true, giving complete diagnostic information.

def _finalize_events_after_run(
self,
timeout: float,
settle_interval: float = 0.2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: The hardcoded settle_interval=0.2 and max_cycles=5 values seem reasonable, but consider:

  • Making these class-level constants with descriptive names (e.g., DEFAULT_SETTLE_INTERVAL, MAX_RECONCILE_CYCLES)
  • This makes it easier to tune these values in tests or for different deployment scenarios
  • Could add an environment variable override for testing edge cases

Comment on lines +752 to +756
last_event_id = self._state.events.get_last_event_id()
if last_event_id:
self._state.events.reconcile(page_id=last_event_id)
else:
self._state.events.reconcile()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Nice: Good improvement to use incremental reconciliation from the last event ID instead of fetching all events again. This reduces unnecessary data transfer and processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants