-
Notifications
You must be signed in to change notification settings - Fork 109
Fix unhandled ConversationRunError causing non-JSON logs #1680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The core issue was that ConversationRunError exceptions were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger. Changes: 1. event_service.py line 314: Changed from fire-and-forget loop.run_in_executor() to await self.run() which has proper exception handling 2. event_service.py line 499-500: Changed logger.error() to logger.exception() to include full traceback in structured logs 3. Updated test to match new send_message behavior Co-authored-by: openhands <[email protected]>
f516fe1 to
79c564b
Compare
Coverage Report •
|
||||||||||||||||||||
The core issue was that exceptions from conversation.run() were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger. Changes: 1. event_service.py send_message(): Wrapped the fire-and-forget run_in_executor call in an async task with exception handling. Uses loop.create_task() to ensure the task runs on the same event loop. This preserves the original fire-and-forget behavior while adding proper error logging. 2. event_service.py run(): Changed logger.error() to logger.exception() to include full traceback in structured logs. 3. Updated test to work with the new async wrapper pattern. Co-authored-by: openhands <[email protected]>
3ad7729 to
dea903e
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling improvements look good overall, but there are a few concerns around task tracking, test reliability, and test coverage. See inline comments for details.
| except Exception: | ||
| logger.exception("Error during conversation run from send_message") | ||
|
|
||
| loop.create_task(_run_with_error_handling()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Untracked fire-and-forget task
The task created here is not stored anywhere, unlike the similar pattern in the run() method (line 515) which stores it in self._run_task. This creates several issues:
- Cannot wait for this task during shutdown (see
close()method at line 575) - Task could be cancelled during service teardown without proper cleanup
- No visibility into whether the task completed or failed
- No way to prevent multiple concurrent runs from
send_message()(vs. the check at line 495 forrun())
Suggestion: Consider either:
- Store these tasks in a set/list (e.g.,
self._background_tasks) and clean them up inclose() - At minimum, add a comment explaining why this is intentionally fire-and-forget without tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with 2.
| # Verify run was called via executor since agent is idle | ||
| mock_loop.run_in_executor.assert_any_call(None, conversation.run) | ||
| # Give the background task a chance to run | ||
| await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Flaky timing-based test
Using asyncio.sleep(0.01) creates a timing-dependent test that could be flaky on slow CI systems or under heavy load. 10ms may not be enough time for the executor to run the task on an overloaded system.
Suggestion: Consider a more deterministic approach, such as mocking loop.create_task to capture and directly await the created task, or using a longer timeout with asyncio.wait_for().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No mocking, but if you think that wait_for is a better option go for it.
| assert mock_loop.run_in_executor.call_count == 1 # Only send_message call | ||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_send_message_with_run_true_agent_idle(self, event_service): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Missing exception handling test
This test does not verify that exceptions from conversation.run() are properly caught and logged. There's a similar test at lines 1105-1125 that tests the run() method's exception handling, but no equivalent for send_message().
Suggestion: Add a test case where conversation.run.side_effect = RuntimeError("Test error") to verify that:
- The exception is caught
- The exception is logged via
logger.exception() - The service continues functioning after the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, check that please.
|
@OpenHands see the review comments and my follow ups, revise, update the PR, and make sure that CI passes. |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Add comment explaining fire-and-forget pattern for send_message() task - Replace asyncio.sleep with asyncio.wait_for for more deterministic testing - Add test for exception handling in send_message() with run=True Co-authored-by: openhands <[email protected]>
|
I've addressed all the PR review comments and pushed the changes to the branch. Here's a summary of what was done: Changes Made (PR #1680)1. ✅ Added comment explaining fire-and-forget patternFile: Added a comment above # Fire-and-forget: This task is intentionally not tracked because
# send_message() is designed to return immediately after queuing the
# message. The conversation run happens in the background and any
# errors are logged. Unlike the run() method which is explicitly
# awaited, this pattern allows clients to send messages without
# blocking on the full conversation execution.2. ✅ Improved test timing with asyncio.wait_forFile: Replaced the flaky async def wait_for_run_called():
while not conversation.run.called:
await asyncio.sleep(0.001)
await asyncio.wait_for(wait_for_run_called(), timeout=1.0)3. ✅ Added exception handling test for send_message()File: Added new test
Verification
The PR is now updated and ready for CI to run. |
Problem
Non-JSON formatted logs were appearing in Datadog for
service:agent-server. Investigation revealed that exceptions fromconversation.run()were not being caught, causing Python's default exception handler to print tracebacks directly to stderr as plain text, bypassing the JSON logger.Example of the problematic logs (each line appearing as a separate log entry with
Logger: unknown):Root Cause
Two places in
event_service.pywere callingconversation.run()without proper exception handling:loop.run_in_executor(None, self._conversation.run)- fire-and-forget call with NO exception handlinglogger.error(f"...")which doesn't include the tracebackChanges
event_service.py send_message(): Wrapped the fire-and-forget
run_in_executorcall in an async task with exception handling. This preserves the original fire-and-forget behavior while adding proper error logging.event_service.py run(): Changed
logger.error(f"Error during conversation run: {e}")tologger.exception("Error during conversation run")to include the full traceback in structured JSON logs.Updated test: Modified
test_send_message_with_run_true_agent_idleto work with the new async wrapper pattern.Result
After this fix:
conversation.run()will be caught and logged through the JSON loggerAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:aee005c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
aee005c-python) is a multi-arch manifest supporting both amd64 and arm64aee005c-python-amd64) are also available if needed