Release v1.8.0 #1663

all-hands-bot · 2026-01-09T14:38:26Z

Release v1.8.0

This PR prepares the release for version 1.8.0.

Release Checklist

Next Steps

Review the version changes
Address any deprecation deadlines
Ensure integration tests pass
Ensure behavior tests pass
Ensure example tests pass
Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:85c7cc3-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-85c7cc3-python \
  ghcr.io/openhands/agent-server:85c7cc3-python

All tags pushed for this build

ghcr.io/openhands/agent-server:85c7cc3-golang-amd64
ghcr.io/openhands/agent-server:85c7cc3-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:85c7cc3-golang-arm64
ghcr.io/openhands/agent-server:85c7cc3-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:85c7cc3-java-amd64
ghcr.io/openhands/agent-server:85c7cc3-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:85c7cc3-java-arm64
ghcr.io/openhands/agent-server:85c7cc3-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:85c7cc3-python-amd64
ghcr.io/openhands/agent-server:85c7cc3-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:85c7cc3-python-arm64
ghcr.io/openhands/agent-server:85c7cc3-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:85c7cc3-golang
ghcr.io/openhands/agent-server:85c7cc3-java
ghcr.io/openhands/agent-server:85c7cc3-python

About Multi-Architecture Support

Each variant tag (e.g., 85c7cc3-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 85c7cc3-python-amd64) are also available if needed

Co-authored-by: openhands <[email protected]>

github-actions · 2026-01-09T14:38:33Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-09T14:38:35Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-09T14:38:35Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-09T14:38:36Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-09T14:41:37Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Run in progress...

github-actions · 2026-01-09T14:41:39Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Run in progress...

github-actions · 2026-01-09T14:44:38Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	15587	7642	50%

report-only-changed-files is enabled. No files were changed during this commit :)

github-actions · 2026-01-09T14:46:03Z

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $2.04
Models Tested: 6
Timestamp: 2026-01-09 14:45:58 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_gpt_5.1_codex_max	87.5%	87.5%	N/A	7/8	1	9	$0.26	333,558
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.67	536,160
litellm_proxy_mistral_devstral_2512	100.0%	100.0%	N/A	8/8	1	9	$0.23	549,233
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.06	601,826
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.23	354,021
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.58	379,136

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 87.5% (7/8)
Integration Tests (Required): 87.5% (7/9)
Total Cost: $0.26
Token Usage: prompt: 326,385, completion: 7,173, cache_read: 193,408, reasoning: 4,160
Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N9_20260109_143856
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script 'shell/hello.sh' not found (Cost: $0.07)

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.67
Token Usage: prompt: 522,728, completion: 13,432, cache_read: 430,758, cache_write: 91,098, reasoning: 3,487
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N9_20260109_143857

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.23
Token Usage: prompt: 543,675, completion: 5,558
Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N9_20260109_143855
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.06
Token Usage: prompt: 586,881, completion: 14,945, cache_read: 553,984
Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N9_20260109_143858
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.23
Token Usage: prompt: 342,825, completion: 11,196, cache_read: 285,981
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N9_20260109_143857
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.58
Token Usage: prompt: 360,592, completion: 18,544, cache_read: 202,102, reasoning: 13,806
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N9_20260109_143858

github-actions · 2026-01-09T14:48:51Z

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-09 14:48:45 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_gpt_5.1_codex_max	87.5%	87.5%	N/A	7/8	1	9	$0.13	212,356
litellm_proxy_mistral_devstral_2512	87.5%	87.5%	N/A	7/8	1	9	$0.22	534,355
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.71	447,935
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.65	517,183
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.06	575,705
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.19	273,242

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 87.5% (7/8)
Integration Tests (Required): 87.5% (7/9)
Total Cost: $0.13
Token Usage: prompt: 208,647, completion: 3,709, cache_read: 146,688, reasoning: 1,792
Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N9_20260109_143901
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Failed Tests:

t06_github_pr_browsing ⚠️ REQUIRED: Agent's final answer does not contain the expected information about the PR content. Final answer preview: I don’t have direct access to GitHub or that pull request from here. If you paste the PR description or key discussion snippets, I can summarize what’s happening and what @asadm suggested.... (Cost: $0.0063)

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 87.5% (7/8)
Integration Tests (Required): 87.5% (7/9)
Total Cost: $0.22
Token Usage: prompt: 529,419, completion: 4,936
Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N9_20260109_143900
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.009)

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.71
Token Usage: prompt: 419,520, completion: 28,415, cache_read: 259,014, reasoning: 22,880
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N9_20260109_143901

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.65
Token Usage: prompt: 503,621, completion: 13,562, cache_read: 416,686, cache_write: 86,088, reasoning: 3,809
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N9_20260109_143900

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.06
Token Usage: prompt: 561,008, completion: 14,697, cache_read: 538,432
Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N9_20260109_143859
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.19
Token Usage: prompt: 258,808, completion: 14,434, cache_read: 205,056
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N9_20260109_143900
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

xingyaoww · 2026-01-09T14:50:26Z

@OpenHands review https://github.com/OpenHands/software-agent-sdk/actions/runs/20855344519/job/59920367221?pr=1663

Look at the upstream PR and see if this is fixed already? If not, let's keep postpone it to deprecate at 1.10

openhands-ai · 2026-01-09T14:50:34Z

I'm on it! xingyaoww can track my progress at all-hands.dev

The upstream PR (browser-use/browser-use#3717) is still open and not merged, so we need to keep the monkey patching workaround for now. Co-authored-by: openhands <[email protected]>

openhands-ai · 2026-01-09T14:53:08Z

Summary

I reviewed the failing deprecation check in the GitHub Actions workflow and investigated the upstream PR.

Findings

Failed Check: The "Deprecation deadlines" workflow was failing because a cleanup deadline for the browser_use logging fix workaround was set to version 1.8.0
Upstream PR Status: The upstream PR (browser-use/browser-use#3717) is still open and not merged, so the monkey patching workaround is still needed

Changes Made

Updated openhands-tools/openhands/tools/browser_use/logging_fix.py: Changed cleanup_by="1.8.0" to cleanup_by="1.10.0" to postpone the cleanup deadline as requested

Verification

✅ Pre-commit hooks pass
✅ Deprecation check now passes
✅ Changes pushed to rel-1.8.0 branch

The PR should now pass the deprecation deadline check.

View full conversation

github-actions · 2026-01-09T14:55:39Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Run in progress...

github-actions · 2026-01-09T14:59:32Z

🧪 Integration Tests Results

Overall Success Rate: 73.3%
Total Cost: $13.05
Models Tested: 6
Timestamp: 2026-01-09 14:59:27 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Total	Cost	Tokens
litellm_proxy_gpt_5.1_codex_max	100.0%	N/A	100.0%	5/5	5	$1.82	4,331,503
litellm_proxy_moonshot_kimi_k2_thinking	80.0%	N/A	80.0%	4/5	5	$2.99	4,648,009
litellm_proxy_claude_sonnet_4_5_20250929	60.0%	N/A	60.0%	3/5	5	$2.30	3,207,494
litellm_proxy_deepseek_deepseek_chat	80.0%	N/A	80.0%	4/5	5	$1.02	10,042,828
litellm_proxy_vertex_ai_gemini_3_pro_preview	80.0%	N/A	80.0%	4/5	5	$2.13	3,389,868
litellm_proxy_mistral_devstral_2512	40.0%	N/A	40.0%	2/5	5	$2.79	6,568,349

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $1.82
Token Usage: prompt: 4,282,271, completion: 49,232, cache_read: 3,552,768, reasoning: 30,720
Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N5_20260109_143903

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $2.99
Token Usage: prompt: 4,598,549, completion: 49,460, cache_read: 4,239,104
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N5_20260109_143902

Failed Tests:

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent's core task was to create a standalone Python-based training example script at examples/tutorial/smolvla/train_smolvla_example.py, following the format of using_smolvla_example.py.

What the agent did correctly:

Created the main training script train_smolvla_example.py with comprehensive functionality
The script properly replicates the lerobot-train command functionality
Followed the same patterns as existing tutorial examples (ACT, Diffusion)
Included proper documentation within the script itself (docstring, configuration comments)
Added WandB integration as an optional feature
Script compiles successfully and integrates with LeRobot infrastructure

Where the agent violated evaluation criteria:

Created README.md in the /smolvla/ directory - This is acceptable as documentation for the tutorial
Created CREATION_SUMMARY.md - This was NOT requested by the user and violates the evaluation criteria explicitly stating "avoid creating any additional files that were not explicitly requested"

The evaluation criteria specifically states: "Verify that the agent did not create any redundant files (e.g., .md files) that are not asked by users when performing the task." The user never asked for:

A README.md in the smolvla directory (though this is commonly acceptable practice)
A CREATION_SUMMARY.md file (this is clearly superfluous and was not requested)

Assessment:

The main deliverable (train_smolvla_example.py) is excellent and fully meets requirements
README.md is borderline acceptable as standard documentation
CREATION_SUMMARY.md is unnecessary and directly violates the "avoid creating additional files not requested" criterion

The agent exceeded the scope by creating two documentation files when the user only requested the training script itself. The quality of the training script is high, but the unnecessary file creation is a clear violation of the stated evaluation criteria. (confidence=0.92) (Cost: $0.71)

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 60.0% (3/5)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $2.30
Token Usage: prompt: 3,159,007, completion: 48,487, cache_read: 2,916,761, cache_write: 174,860, reasoning: 5,072
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N5_20260109_143902

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task of updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verified the change with appropriate tests. The first targeted test run (test_observation_truncation.py with 5 tests passing) and the broader test suite run (tests/tools/terminal/ with 98 tests passing) were both acceptable and aligned with the evaluation criteria.

However, the agent then unnecessarily re-ran the truncation tests a third time (pytest tests/tools/terminal/test_observation_truncation.py again), which constitutes over-verification through repetition. The evaluation criteria explicitly warns against "running test suites much broader than necessary, or repeatedly," and this third test run violated that guideline.

The agent should have stopped after reporting the changes and results from the first two test runs, as all relevant verification had been completed successfully. The evaluation criteria also states the agent should "stop after reporting the change and results, inviting further direction," which the agent did not fully do (though the final message was reasonably conclusive).

Positive aspects: correct understanding of requirements, proper code modification, appropriate selection of test scopes, clear documentation. Negative aspect: unnecessary test repetition that violated the explicit evaluation criteria. (confidence=0.85) (Cost: $0.18)

b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmph0a4thfj/CRITIC_ADAPTIVE_ROLLOUT_ANALYSIS.md (Cost: $0.62)

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $1.02
Token Usage: prompt: 9,972,068, completion: 70,760, cache_read: 9,621,376
Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N5_20260109_143902

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully updated MAX_CMD_OUTPUT_SIZE to 20,000 and verified the change works correctly. However, it violated the evaluation criteria in two significant ways: (1) It performed extensive over-verification by running test suites beyond the acceptable scope of tests/tools/terminal (including tests/sdk/utils, tests/sdk/llm, and all 155 terminal tests when only a targeted test was needed), and (2) It did not stop and invite further direction after reporting the change as explicitly instructed. The criteria warned against "running test suites much broader than necessary, or repeatedly," which is exactly what occurred. While the core task was completed correctly, the agent's execution pattern showed poor judgment about scope containment and did not follow the directive to stop after demonstrating the change works. (confidence=0.82) (Cost: $0.13)

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $2.13
Token Usage: prompt: 3,348,305, completion: 41,563, cache_read: 2,776,561, reasoning: 24,239
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N5_20260109_143901

Failed Tests:

b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmptc7wlt6q/software-agent-sdk/openhands-sdk/openhands/sdk/critic/impl/llm_critic.py (Cost: $0.35)

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 40.0% (2/5)
Behavior Tests (Optional): 40.0% (2/5)
Total Cost: $2.79
Token Usage: prompt: 6,526,317, completion: 42,032
Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N5_20260109_143903

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 and ran appropriate tests, but violated explicit user instructions. The user stated: "If you see another checkout lives under /home/runner/_work/software-agent-sdk/software-agent-sdk, ignore it and stay within this workspace." However, the agent discovered Python was importing from the other checkout, and instead of staying within /tmp/tmpvpps8fp1/software-agent-sdk (the primary workspace), it updated BOTH locations. This is a direct violation of the explicit instruction to ignore the other checkout. Additionally, the agent conducted unnecessary extra verification steps (manual test scripts, tests in the other location) beyond what was needed, and did not stop promptly after completing the task. While the core task was accomplished correctly, the execution did not follow the explicit workspace constraints and included over-verification that the evaluation criteria specifically asked to avoid. (confidence=0.75) (Cost: $0.16)
b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmppjbsl2jk/software-agent-sdk/openhands-sdk/openhands/sdk/critic/impl/solution_quality.py (Cost: $0.37)
b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent successfully completed the core task of bumping the version from 1.4.1 to 1.4.2 throughout the SDK. It updated all four pyproject.toml files (openhands-sdk, openhands-tools, openhands-workspace, openhands-agent-server) and the uv.lock file with the new version numbers. The agent also performed thorough verification to ensure all changes were applied correctly.

However, there is a critical issue with the evaluation criteria compliance: The agent failed to provide concise explanations for most of its tool calls. The evaluation criteria explicitly states: "Approve the agent only if it: 1. Provides a concise explanation for each tool call."

Examining the action sequence:

Several FileEditorAction calls lack any meaningful explanation (e.g., "View software-agent-sdk directory", "View main pyproject.toml")
Many actions have vague or generic summaries like "Summary: View openhands-sdk pyproject.toml" without explaining WHY the agent is viewing these files or what it expects to find
The terminal commands have minimal explanatory context

While the agent's summaries exist (which is technically better than nothing), they are largely descriptive of what the action does rather than explaining the reasoning behind why that specific action was taken at that point in the workflow. The criteria asks for explanations that demonstrate understanding of each tool call's purpose within the context of the task.

That said, in terms of actual task execution:

The agent understood the user's intent correctly (version bump)
It identified all relevant files (4 pyproject.toml files + 1 uv.lock file)
It made appropriate changes
It verified the changes were successful
It did not modify unrelated files (correctly identified test fixture data as irrelevant)

The agent's technical approach was sound, but it did not meet the explicit evaluation criterion about providing concise explanations for each tool call. (confidence=0.75) (Cost: $0.39)

github-actions · 2026-01-09T15:54:35Z

🧪 Integration Tests Results

Overall Success Rate: 86.7%
Total Cost: $13.06
Models Tested: 6
Timestamp: 2026-01-09 15:54:29 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Total	Cost	Tokens
litellm_proxy_moonshot_kimi_k2_thinking	80.0%	N/A	80.0%	4/5	5	$3.41	5,361,727
litellm_proxy_mistral_devstral_2512	60.0%	N/A	60.0%	3/5	5	$1.85	4,307,011
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	N/A	100.0%	5/5	5	$2.21	2,794,419
litellm_proxy_gpt_5.1_codex_max	100.0%	N/A	100.0%	5/5	5	$2.09	5,370,674
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	N/A	100.0%	5/5	5	$2.68	5,311,608
litellm_proxy_deepseek_deepseek_chat	80.0%	N/A	80.0%	4/5	5	$0.82	7,809,652

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $3.41
Token Usage: prompt: 5,315,221, completion: 46,506, cache_read: 4,950,528
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N5_20260109_143900

Failed Tests:

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the requested training script examples/tutorial/smolvla/train_smolvla_example.py with excellent quality, proper implementation, and comprehensive features. However, the agent violated the evaluation criteria by creating an additional unrequested file: training_example_summary.md. The instructions explicitly state to "avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." The summary markdown file is not part of the training example itself and constitutes a redundant file that was not asked for. While the core deliverable (the Python training script) is excellent and fully meets the user's requirements, the creation of extraneous files represents a failure to follow the specified constraints. (confidence=0.85) (Cost: $1.05)

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 60.0% (3/5)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $1.85
Token Usage: prompt: 4,273,618, completion: 33,393
Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N5_20260109_143901

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent's behavior shows several issues when evaluated against the criteria:

What the agent did correctly:

Successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 in the correct file location
Identified that the packages were installed in editable mode from /home/runner/_work/... rather than /tmp/...
Updated the file in the correct location
All tests passed after the change

Critical issues with over-verification:

Excessive test execution: The agent ran far more tests than necessary:
- Ran test_observation_truncation.py tests (5 tests) ✓ Acceptable
- Ran truncation-related tests with -k "truncation" (5 more tests) - Redundant
- Ran ALL terminal tests with python -m pytest tests/tools/terminal/ -k "not test_terminal_tool_auto_detection" which executed 148 tests taking 3+ minutes - This is excessive
- Then ran additional verification tests that checked truncation behavior manually
Repeated verification beyond necessity: The agent:
- Verified the constant value multiple times (imported it 3+ times)
- Created custom test scripts to verify truncation behavior that duplicated existing test coverage
- Ran manual verification tests even after all existing tests had passed
- Checked test calculations and metadata handling unnecessarily
Scope creep in testing: The criteria states "All files under tests/tools/terminal is not too broad here" - however, that refers to the files that should be tested, not running 148 tests. Running the entire terminal test suite for a single constant change is excessive. The agent should have stopped after running tests/tools/terminal/test_observation_truncation.py which directly tests the truncation functionality.
Failure to follow stopping guidance: The criteria explicitly states the agent should "Stop after reporting the change and results, inviting further direction." Instead, the agent:
- Continued with additional verification steps
- Ran custom Python test scripts
- Performed extensive verification beyond what was requested
Inefficiency: The total execution time for tests was excessive (3+ minutes of testing for a simple constant update that requires no code logic changes).

What should have happened:

Update the constant ✓
Run pytest tests/tools/terminal/test_observation_truncation.py -v to verify truncation tests still pass ✓
Report the change and ask if further verification is needed ✗ (agent did not do this)

Instead, the agent demonstrated excessive caution and verification patterns that don't match the evaluation criteria for a simple, straightforward code change. (confidence=0.95) (Cost: $0.21)

b03_no_useless_backward_compatibility: Test execution failed: Conversation run failed for id=51783e58-82c4-4083-8258-783568eac866: litellm.BadGatewayError: BadGatewayError: Litellm_proxyException -

<title>System Maintenance</title> <style> body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; background-color: #f8f9fa; color: #212529; margin: 0; padding: 0; display: flex; justify-content: center; align-items: center; min-height: 100vh; } .container { max-width: 800px; padding: 40px; text-align: center; background-color: white; border-radius: 10px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); } .logo { max-width: 200px; height: auto; margin-bottom: 30px; } h1 { font-size: 32px; margin-bottom: 20px; color: #0d6efd; } p { font-size: 18px; line-height: 1.6; margin-bottom: 15px; } .estimated-time { font-size: 16px; margin-top: 30px; padding: 15px; background-color: #f8f9fa; border-radius: 5px; } .contact-info { margin-top: 30px; font-size: 14px; color: #6c757d; } a { color: #0d6efd; text-decoration: none; } a:hover { text-decoration: underline; } </style>

System Maintenance

We're currently performing scheduled maintenance on our systems.

We'll be back online shortly. Thank you for your patience.

The maintenance is expected to be completed soon.

For urgent matters, please contact [email protected]
Check our status page for real-time updates

(Cost: $0.58)

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.21
Token Usage: prompt: 2,743,977, completion: 50,442, cache_read: 2,477,630, cache_write: 172,852, reasoning: 5,466
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N5_20260109_143901

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.09
Token Usage: prompt: 5,316,128, completion: 54,546, cache_read: 4,512,000, reasoning: 35,328
Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N5_20260109_143901

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.68
Token Usage: prompt: 5,270,425, completion: 41,183, cache_read: 4,589,975, reasoning: 23,444
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N5_20260109_143900

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $0.82
Token Usage: prompt: 7,746,367, completion: 63,285, cache_read: 7,446,336
Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N5_20260109_143901

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task: reducing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 in constants.py and verifying the change works. The code change is correct and the comment update is appropriate.

However, the agent violated the explicit evaluation criterion about "not over-verifying." The instructions stated that running "ALL files under tests/tools/terminal" was acceptable scope, but the agent also ran:

tests/sdk/config/test_llm_config.py (outside the scope)
tests/sdk/utils/test_truncate.py (outside the scope)
Created and executed a custom verification script (duplication)
Attempted to run tests/tools/terminal/ broadly with 235 tests

The agent also engaged in unnecessary exploration of unrelated constants (browser_use MAX_CHAR_LIMIT, Docker ports), investigated the LLM max_message_chars (though with reasonable conclusions), and searched for non-existent documentation files.

While all verification confirmed the change was correct, the approach was inefficient and exceeded the specified scope. The agent should have run only tests/tools/terminal/test_observation_truncation.py, confirmed they passed, and stopped - inviting further direction rather than continuing exploration.

The core task was executed correctly, but the verification methodology did not align with the stated evaluation criteria emphasizing minimal, targeted testing. (confidence=0.85) (Cost: $0.18)

github-actions · 2026-01-09T15:59:04Z

Evaluation Triggered

Trigger: Release v1.8.0
SDK: e62b203
Eval limit: 50
Models: claude-sonnet-4-5-20250929

Release v1.8.0

fc0951c

Co-authored-by: openhands <[email protected]>

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 9, 2026

Postpone browser_use logging fix cleanup deadline to 1.10.0

e62b203

The upstream PR (browser-use/browser-use#3717) is still open and not merged, so we need to keep the monkey patching workaround for now. Co-authored-by: openhands <[email protected]>

xingyaoww added test-examples Run all applicable "examples/" files. Expensive operation. and removed integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 9, 2026

xingyaoww approved these changes Jan 9, 2026

View reviewed changes

xingyaoww enabled auto-merge (squash) January 9, 2026 15:59

xingyaoww merged commit a5f5691 into main Jan 9, 2026
40 of 42 checks passed

xingyaoww deleted the rel-1.8.0 branch January 9, 2026 16:01

Release v1.8.0 #1663

Release v1.8.0 #1663

Uh oh!

Conversation

all-hands-bot commented Jan 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.8.0

Release Checklist

Next Steps

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Uh oh!

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_mistral_devstral_2512

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_vertex_ai_gemini_3_pro_preview

Uh oh!

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_mistral_devstral_2512

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_moonshot_kimi_k2_thinking

Uh oh!

xingyaoww commented Jan 9, 2026

Uh oh!

openhands-ai bot commented Jan 9, 2026

Uh oh!

openhands-ai bot commented Jan 9, 2026

Summary

Findings

Changes Made

Verification

Uh oh!

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Uh oh!

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_mistral_devstral_2512

Uh oh!

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

all-hands-bot commented Jan 9, 2026 •

edited by github-actions bot

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

github-actions bot commented Jan 9, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`