Fix unhandled ConversationRunError causing non-JSON logs #1680

neubig · 2026-01-09T21:31:50Z

Problem

Non-JSON formatted logs were appearing in Datadog for service:agent-server. Investigation revealed that exceptions from conversation.run() were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger.

Example of the problematic logs (each line appearing as a separate log entry with Logger: unknown):

Traceback (most recent call last):
File "openhands/sdk/conversation/impl/local_conversation.py", line 411, in run
...
openhands.sdk.conversation.exceptions.ConversationRunError: Conversation run failed for id=...

Root Cause

Two places in event_service.py were calling conversation.run() without proper exception handling:

Line 314 (send_message): loop.run_in_executor(None, self._conversation.run) - fire-and-forget call with NO exception handling
Lines 499-500 (run method): Had a try/except but used logger.error(f"...") which doesn't include the traceback

Changes

event_service.py send_message(): Wrapped the fire-and-forget run_in_executor call in an async task with exception handling. This preserves the original fire-and-forget behavior while adding proper error logging.
event_service.py run(): Changed logger.error(f"Error during conversation run: {e}") to logger.exception("Error during conversation run") to include the full traceback in structured JSON logs.
Updated test: Modified test_send_message_with_run_true_agent_idle to work with the new async wrapper pattern.

Result

After this fix:

All exceptions from conversation.run() will be caught and logged through the JSON logger
Full tracebacks will be included in the structured log output
The original fire-and-forget behavior is preserved (no behavioral changes)

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:aee005c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-aee005c-python \
  ghcr.io/openhands/agent-server:aee005c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:aee005c-golang-amd64
ghcr.io/openhands/agent-server:aee005c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:aee005c-golang-arm64
ghcr.io/openhands/agent-server:aee005c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:aee005c-java-amd64
ghcr.io/openhands/agent-server:aee005c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:aee005c-java-arm64
ghcr.io/openhands/agent-server:aee005c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:aee005c-python-amd64
ghcr.io/openhands/agent-server:aee005c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:aee005c-python-arm64
ghcr.io/openhands/agent-server:aee005c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:aee005c-golang
ghcr.io/openhands/agent-server:aee005c-java
ghcr.io/openhands/agent-server:aee005c-python

About Multi-Architecture Support

Each variant tag (e.g., aee005c-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., aee005c-python-amd64) are also available if needed

The core issue was that ConversationRunError exceptions were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger. Changes: 1. event_service.py line 314: Changed from fire-and-forget loop.run_in_executor() to await self.run() which has proper exception handling 2. event_service.py line 499-500: Changed logger.error() to logger.exception() to include full traceback in structured logs 3. Updated test to match new send_message behavior Co-authored-by: openhands <[email protected]>

github-actions · 2026-01-09T21:41:42Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-agent-server/openhands/agent_server
event_service.py	320	164	48%	55–56, 75–77, 81–86, 89–92, 107, 123, 127, 131–132, 139, 141, 148–149, 157–160, 167–169, 186, 210–211, 214–215, 217–219, 221, 226, 229–230, 233–235, 238, 242–244, 246, 248, 259–262, 275–276, 279–280, 283, 286–288, 291–292, 295–296, 300, 303, 307, 311–312, 314, 316–320, 328, 345–346, 363, 365, 369–371, 375, 384–385, 387, 391, 397, 399, 407–412, 461, 463–466, 475, 491, 498, 502, 513–514, 524–527, 529–530, 534, 536, 540–543, 548–550, 552, 556–559, 563–566, 574–577, 596–597, 599–606, 608–609, 618–619, 621–622, 629–630, 632–633, 637, 643, 653–654, 661
TOTAL	14843	7116	52%

The core issue was that exceptions from conversation.run() were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger. Changes: 1. event_service.py send_message(): Wrapped the fire-and-forget run_in_executor call in an async task with exception handling. Uses loop.create_task() to ensure the task runs on the same event loop. This preserves the original fire-and-forget behavior while adding proper error logging. 2. event_service.py run(): Changed logger.error() to logger.exception() to include full traceback in structured logs. 3. Updated test to work with the new async wrapper pattern. Co-authored-by: openhands <[email protected]>

all-hands-bot

The exception handling improvements look good overall, but there are a few concerns around task tracking, test reliability, and test coverage. See inline comments for details.

all-hands-bot · 2026-01-09T21:55:41Z

openhands-agent-server/openhands/agent_server/event_service.py

+                except Exception:
+                    logger.exception("Error during conversation run from send_message")
+
+            loop.create_task(_run_with_error_handling())


Issue: Untracked fire-and-forget task

The task created here is not stored anywhere, unlike the similar pattern in the run() method (line 515) which stores it in self._run_task. This creates several issues:

Cannot wait for this task during shutdown (see close() method at line 575)

Task could be cancelled during service teardown without proper cleanup

No visibility into whether the task completed or failed

No way to prevent multiple concurrent runs from send_message() (vs. the check at line 495 for run())

Suggestion: Consider either:

Store these tasks in a set/list (e.g., self._background_tasks) and clean them up in close()

At minimum, add a comment explaining why this is intentionally fire-and-forget without tracking

Let's go with 2.

all-hands-bot · 2026-01-09T21:55:41Z

tests/agent_server/test_event_service.py

-            # Verify run was called via executor since agent is idle
-            mock_loop.run_in_executor.assert_any_call(None, conversation.run)
+        # Give the background task a chance to run
+        await asyncio.sleep(0.01)


Issue: Flaky timing-based test

Using asyncio.sleep(0.01) creates a timing-dependent test that could be flaky on slow CI systems or under heavy load. 10ms may not be enough time for the executor to run the task on an overloaded system.

Suggestion: Consider a more deterministic approach, such as mocking loop.create_task to capture and directly await the created task, or using a longer timeout with asyncio.wait_for().

No mocking, but if you think that wait_for is a better option go for it.

all-hands-bot · 2026-01-09T21:55:41Z

tests/agent_server/test_event_service.py

            assert mock_loop.run_in_executor.call_count == 1  # Only send_message call

    @pytest.mark.asyncio
    async def test_send_message_with_run_true_agent_idle(self, event_service):


Issue: Missing exception handling test

This test does not verify that exceptions from conversation.run() are properly caught and logged. There's a similar test at lines 1105-1125 that tests the run() method's exception handling, but no equivalent for send_message().

Suggestion: Add a test case where conversation.run.side_effect = RuntimeError("Test error") to verify that:

The exception is caught

The exception is logged via logger.exception()

The service continues functioning after the error

Yes, check that please.

neubig · 2026-01-10T01:20:11Z

@OpenHands see the review comments and my follow ups, revise, update the PR, and make sure that CI passes.

openhands-ai · 2026-01-10T01:20:19Z

I'm on it! neubig can track my progress at all-hands.dev

- Add comment explaining fire-and-forget pattern for send_message() task - Replace asyncio.sleep with asyncio.wait_for for more deterministic testing - Add test for exception handling in send_message() with run=True Co-authored-by: openhands <[email protected]>

openhands-ai · 2026-01-10T01:24:05Z

I've addressed all the PR review comments and pushed the changes to the branch. Here's a summary of what was done:

Changes Made (PR #1680)

1. ✅ Added comment explaining fire-and-forget pattern

File: openhands-agent-server/openhands/agent_server/event_service.py

Added a comment above loop.create_task(_run_with_error_handling()) explaining why the task is intentionally not tracked (per neubig's choice of option 2):

# Fire-and-forget: This task is intentionally not tracked because
# send_message() is designed to return immediately after queuing the
# message. The conversation run happens in the background and any
# errors are logged. Unlike the run() method which is explicitly
# awaited, this pattern allows clients to send messages without
# blocking on the full conversation execution.

2. ✅ Improved test timing with asyncio.wait_for

File: tests/agent_server/test_event_service.py

Replaced the flaky asyncio.sleep(0.01) with a more deterministic polling approach using asyncio.wait_for():

async def wait_for_run_called():
    while not conversation.run.called:
        await asyncio.sleep(0.001)

await asyncio.wait_for(wait_for_run_called(), timeout=1.0)

3. ✅ Added exception handling test for send_message()

File: tests/agent_server/test_event_service.py

Added new test test_send_message_with_run_true_logs_exception that verifies:

Exceptions from conversation.run() are caught
Exceptions are logged via logger.exception()
The service continues functioning after the error

Verification

All 57 tests in test_event_service.py pass
Pre-commit hooks (ruff format, ruff lint, pycodestyle, pyright) all pass
Changes have been pushed to origin/fix-conversation-run-exception-handling

The PR is now updated and ready for CI to run.

View full conversation

neubig force-pushed the fix-conversation-run-exception-handling branch from f516fe1 to 79c564b Compare January 9, 2026 21:39

neubig force-pushed the fix-conversation-run-exception-handling branch from 3ad7729 to dea903e Compare January 9, 2026 21:48

Merge branch 'main' into fix-conversation-run-exception-handling

2d83402

neubig marked this pull request as ready for review January 9, 2026 21:52

all-hands-bot reviewed Jan 9, 2026

View reviewed changes

neubig requested a review from tofarr January 14, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unhandled ConversationRunError causing non-JSON logs #1680

Fix unhandled ConversationRunError causing non-JSON logs #1680

Uh oh!

neubig commented Jan 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jan 9, 2026

Uh oh!

neubig Jan 10, 2026

Uh oh!

all-hands-bot Jan 9, 2026

Uh oh!

neubig Jan 10, 2026

Uh oh!

all-hands-bot Jan 9, 2026

Uh oh!

neubig Jan 10, 2026

Uh oh!

neubig commented Jan 10, 2026

Uh oh!

openhands-ai bot commented Jan 10, 2026

Uh oh!

openhands-ai bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix unhandled ConversationRunError causing non-JSON logs #1680

Are you sure you want to change the base?

Fix unhandled ConversationRunError causing non-JSON logs #1680

Uh oh!

Conversation

neubig commented Jan 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Changes

Result

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

neubig Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

neubig Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

neubig Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

neubig commented Jan 10, 2026

Uh oh!

openhands-ai bot commented Jan 10, 2026

Uh oh!

openhands-ai bot commented Jan 10, 2026

Changes Made (PR #1680)

1. ✅ Added comment explaining fire-and-forget pattern

2. ✅ Improved test timing with asyncio.wait_for

3. ✅ Added exception handling test for send_message()

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neubig commented Jan 9, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Jan 9, 2026 •

edited

Loading