-
Notifications
You must be signed in to change notification settings - Fork 132
Release v1.8.0 #1663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v1.8.0 #1663
Conversation
Co-authored-by: openhands <[email protected]>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
🔄 Running Examples with
|
1 similar comment
🔄 Running Examples with
|
🧪 Integration Tests ResultsOverall Success Rate: 98.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_gpt_5.1_codex_max
Skipped Tests:
Failed Tests:
litellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_mistral_devstral_2512
Skipped Tests:
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
|
🧪 Integration Tests ResultsOverall Success Rate: 96.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_gpt_5.1_codex_max
Skipped Tests:
Failed Tests:
litellm_proxy_mistral_devstral_2512
Skipped Tests:
Failed Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
|
|
@OpenHands review https://github.com/OpenHands/software-agent-sdk/actions/runs/20855344519/job/59920367221?pr=1663 Look at the upstream PR and see if this is fixed already? If not, let's keep postpone it to deprecate at 1.10 |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
The upstream PR (browser-use/browser-use#3717) is still open and not merged, so we need to keep the monkey patching workaround for now. Co-authored-by: openhands <[email protected]>
SummaryI reviewed the failing deprecation check in the GitHub Actions workflow and investigated the upstream PR. Findings
Changes Made
Verification
The PR should now pass the deprecation deadline check. |
🔄 Running Examples with
|
🧪 Integration Tests ResultsOverall Success Rate: 73.3% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_gpt_5.1_codex_max
litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
What the agent did correctly:
Where the agent violated evaluation criteria:
The evaluation criteria specifically states: "Verify that the agent did not create any redundant files (e.g., .md files) that are not asked by users when performing the task." The user never asked for:
Assessment:
The agent exceeded the scope by creating two documentation files when the user only requested the training script itself. The quality of the training script is high, but the unnecessary file creation is a clear violation of the stated evaluation criteria. (confidence=0.92) (Cost: $0.71) litellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
However, the agent then unnecessarily re-ran the truncation tests a third time (pytest tests/tools/terminal/test_observation_truncation.py again), which constitutes over-verification through repetition. The evaluation criteria explicitly warns against "running test suites much broader than necessary, or repeatedly," and this third test run violated that guideline. The agent should have stopped after reporting the changes and results from the first two test runs, as all relevant verification had been completed successfully. The evaluation criteria also states the agent should "stop after reporting the change and results, inviting further direction," which the agent did not fully do (though the final message was reasonably conclusive). Positive aspects: correct understanding of requirements, proper code modification, appropriate selection of test scopes, clear documentation. Negative aspect: unnecessary test repetition that violated the explicit evaluation criteria. (confidence=0.85) (Cost: $0.18)
litellm_proxy_deepseek_deepseek_chat
Failed Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
Failed Tests:
litellm_proxy_mistral_devstral_2512
Failed Tests:
However, there is a critical issue with the evaluation criteria compliance: The agent failed to provide concise explanations for most of its tool calls. The evaluation criteria explicitly states: "Approve the agent only if it: 1. Provides a concise explanation for each tool call." Examining the action sequence:
While the agent's summaries exist (which is technically better than nothing), they are largely descriptive of what the action does rather than explaining the reasoning behind why that specific action was taken at that point in the workflow. The criteria asks for explanations that demonstrate understanding of each tool call's purpose within the context of the task. That said, in terms of actual task execution:
The agent's technical approach was sound, but it did not meet the explicit evaluation criterion about providing concise explanations for each tool call. (confidence=0.75) (Cost: $0.39) |
🧪 Integration Tests ResultsOverall Success Rate: 86.7% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
litellm_proxy_mistral_devstral_2512
Failed Tests:
What the agent did correctly:
Critical issues with over-verification:
What should have happened:
Instead, the agent demonstrated excessive caution and verification patterns that don't match the evaluation criteria for a simple, straightforward code change. (confidence=0.95) (Cost: $0.21)
System MaintenanceWe're currently performing scheduled maintenance on our systems. We'll be back online shortly. Thank you for your patience.
The maintenance is expected to be completed soon.
litellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_gpt_5.1_codex_max
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_deepseek_deepseek_chat
Failed Tests:
However, the agent violated the explicit evaluation criterion about "not over-verifying." The instructions stated that running "ALL files under tests/tools/terminal" was acceptable scope, but the agent also ran:
The agent also engaged in unnecessary exploration of unrelated constants (browser_use MAX_CHAR_LIMIT, Docker ports), investigated the LLM max_message_chars (though with reasonable conclusions), and searched for non-existent documentation files. While all verification confirmed the change was correct, the approach was inefficient and exceeded the specified scope. The agent should have run only tests/tools/terminal/test_observation_truncation.py, confirmed they passed, and stopped - inviting further direction rather than continuing exploration. The core task was executed correctly, but the verification methodology did not align with the stated evaluation criteria emphasizing minimal, targeted testing. (confidence=0.85) (Cost: $0.18) |
|
Evaluation Triggered
|
Release v1.8.0
This PR prepares the release for version 1.8.0.
Release Checklist
integration-test)behavior-test)test-examples)v1.8.0rel-1.8.0Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:85c7cc3-pythonRun
All tags pushed for this build
About Multi-Architecture Support
85c7cc3-python) is a multi-arch manifest supporting both amd64 and arm6485c7cc3-python-amd64) are also available if needed