-
Notifications
You must be signed in to change notification settings - Fork 130
fix: reconcile remote events after run #1865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid refactoring of event reconciliation logic! Found a few important issues around error handling and edge cases that should be addressed before merge.
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
Outdated
Show resolved
Hide resolved
🔴 /codereview-roasted: Mostly good, but watch the hidden costsTaste Rating: 🟡 Acceptable — The core idea (REST is truth, WS is best-effort) is sound. The implementation is simple and direct. But you’re paying for it with extra REST churn and fuzzy completeness guarantees. [CRITICAL ISSUES] (Must fix)None. This doesn’t break userspace and it actually fixes the hang/WS race. [IMPROVEMENT OPPORTUNITIES] (Should fix)[remote_conversation.py, Lines ~901–923] [remote_conversation.py, Lines ~304–314] [STYLE NOTES][remote_conversation.py, Lines ~904–923] VERDICT: ✅ Worth mergingIt removes the WS-only liveness dependency without forcing a redesign. It’s not elegant, but it’s pragmatic. Just keep an eye on REST overhead and make pagination semantics explicit in the long run. Key insight: This is a correctness-first fix, not a throughput fix. That’s fine — but own the tradeoff. |
|
@OpenHands Get yourself Consider this first:
Is this really so? A full REST reconcile plus 5 paged is unacceptable. Why not use last_event_id ? Note: history can have a lot of events, e.g. 60k events each up to 30k tokens! Please reconsider and use carefully the REST fallback. We don't want to lose events, but it has to be best effort, while resilience and robustness of the application (well the sdk and any client app) is far more important. Note that first ~10 and last ~10 events are relied upon, and we can have the rest as best as possible. Then, once you solve this important problem, please look at the other issues raised in comments and inline threads. Note that they are raised by AIs, so they might be wrong, we need to validate each claim/suggestion. If you think it needs fixed, fix it, if not just reply with your reasoning. Commit (run precommit too) and push. |
|
I'm on it! enyst can track my progress at all-hands.dev |
Co-authored-by: openhands <[email protected]>
|
Checklist review:
Git status/diff:
Summary of new changes since last update:
Already pushed to |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
WebSocket-first with REST backstop (post-run tail reconcile)Below is a sequence view of the current flow. I’m calling it WebSocket-first with REST backstop because WS is the primary stream and REST only fills gaps when timing goes sideways. sequenceDiagram
participant Client
participant Server
participant WS as WebSocket
participant REST as REST /events/search
Client->>REST: Initial full sync (startup only)
Client->>WS: Subscribe to events
WS-->>Client: ConversationStateUpdateEvent (ready signal)
Client->>REST: Reconcile from last_event_id (handshake gap)
Note over Client,Server: Normal run
Client->>Server: POST /run
loop during run
WS-->>Client: Event stream (Action/Observation/Message)
end
Server-->>Client: run completes (status polling)
Client->>REST: Post-run reconcile from last_event_id (tail backstop)
alt no new events
Client-->>Client: exit quickly (no extra calls)
else new events appear
Client->>REST: additional page_id reconciles (bounded)
end
Execution flow (blunt version)
Problem solved
Tradeoffs (also blunt)
|
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall solid refactoring that improves event synchronization reliability. The incremental reconciliation approach is well-designed. Main concerns are around error handling transparency and edge case documentation.
| "Performing reconciliation sync for conversation %s (page_id=%s)", | ||
| self._conversation_id, | ||
| page_id, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: The ignore_errors=True here means reconciliation failures are silent. Since this is called during _finalize_events_after_run, consider whether you want to:
- Add retry logic for transient failures
- Distinguish between transient network errors and permanent failures
- Surface errors to the caller somehow (return value, exception, callback)
|
|
||
| if not data.get("next_page_id"): | ||
| break | ||
| page_id = data["next_page_id"] | ||
| if page_id: | ||
| # The API may return the page_id event again depending on whether | ||
| # pagination is inclusive. Drop it to avoid duplicates regardless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: The comment indicates uncertainty about whether the API returns the page_id event (inclusive vs exclusive pagination). Consider:
- Document the actual API behavior - test and confirm whether pagination is inclusive/exclusive
- Add a test case that verifies this filtering works correctly for both scenarios
- If this uncertainty is due to API versioning, consider checking the API version or making the behavior explicit in API documentation
This filtering is safe (filters 0 or 1 events) but the uncertainty suggests potential for bugs if API behavior changes.
| max_cycles: int = 5, | ||
| ) -> None: | ||
| """Ensure REST-backed events are fully synced after run completion.""" | ||
| if timeout <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟢 Nit: This early return when timeout <= 0 is reasonable, but consider:
- Should we log a debug message here?
- Should negative timeouts raise ValueError, or is silent no-op acceptable?
- Document this behavior in the docstring
| logger.warning( | ||
| "Event reconciliation reached deadline (timeout=%.1fs); " | ||
| "results may be incomplete", | ||
| timeout, | ||
| ) | ||
| elif cycles >= max_cycles: | ||
| logger.warning( | ||
| "Event reconciliation hit max cycles (%d); results may be incomplete", | ||
| max_cycles, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: The elif structure means only one warning is shown if both timeout AND max_cycles are hit simultaneously. This could be confusing for debugging. Consider:
| logger.warning( | |
| "Event reconciliation reached deadline (timeout=%.1fs); " | |
| "results may be incomplete", | |
| timeout, | |
| ) | |
| elif cycles >= max_cycles: | |
| logger.warning( | |
| "Event reconciliation hit max cycles (%d); results may be incomplete", | |
| max_cycles, | |
| ) | |
| if time.monotonic() >= deadline: | |
| logger.warning( | |
| "Event reconciliation reached deadline (timeout=%.1fs); " | |
| "results may be incomplete", | |
| timeout, | |
| ) | |
| if cycles >= max_cycles: | |
| logger.warning( | |
| "Event reconciliation hit max cycles (%d); results may be incomplete", | |
| max_cycles, | |
| ) |
This way both warnings are shown if both conditions are true, giving complete diagnostic information.
| def _finalize_events_after_run( | ||
| self, | ||
| timeout: float, | ||
| settle_interval: float = 0.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟢 Nit: The hardcoded settle_interval=0.2 and max_cycles=5 values seem reasonable, but consider:
- Making these class-level constants with descriptive names (e.g.,
DEFAULT_SETTLE_INTERVAL,MAX_RECONCILE_CYCLES) - This makes it easier to tune these values in tests or for different deployment scenarios
- Could add an environment variable override for testing edge cases
| last_event_id = self._state.events.get_last_event_id() | ||
| if last_event_id: | ||
| self._state.events.reconcile(page_id=last_event_id) | ||
| else: | ||
| self._state.events.reconcile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Nice: Good improvement to use incremental reconciliation from the last event ID instead of fetching all events again. This reduces unnecessary data transfer and processing.
Summary
Testing
uv run pytest tests/sdk/conversation/remote/test_remote_conversation.pyuv run pytest tests/cross/test_remote_conversation_live_server.py::test_post_run_reconcile_needed_under_ws_callback_lag@enyst can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:a0d22b0-pythonRun
All tags pushed for this build
About Multi-Architecture Support
a0d22b0-python) is a multi-arch manifest supporting both amd64 and arm64a0d22b0-python-amd64) are also available if needed