Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

@all-hands-bot all-hands-bot commented Jan 9, 2026

Release v1.8.0

This PR prepares the release for version 1.8.0.

Release Checklist

  • Version set to 1.8.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.8.0
    • Select branch: rel-1.8.0
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:85c7cc3-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-85c7cc3-python \
  ghcr.io/openhands/agent-server:85c7cc3-python

All tags pushed for this build

ghcr.io/openhands/agent-server:85c7cc3-golang-amd64
ghcr.io/openhands/agent-server:85c7cc3-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:85c7cc3-golang-arm64
ghcr.io/openhands/agent-server:85c7cc3-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:85c7cc3-java-amd64
ghcr.io/openhands/agent-server:85c7cc3-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:85c7cc3-java-arm64
ghcr.io/openhands/agent-server:85c7cc3-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:85c7cc3-python-amd64
ghcr.io/openhands/agent-server:85c7cc3-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:85c7cc3-python-arm64
ghcr.io/openhands/agent-server:85c7cc3-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:85c7cc3-golang
ghcr.io/openhands/agent-server:85c7cc3-java
ghcr.io/openhands/agent-server:85c7cc3-python

About Multi-Architecture Support

  • Each variant tag (e.g., 85c7cc3-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 85c7cc3-python-amd64) are also available if needed

Co-authored-by: openhands <[email protected]>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Run in progress...

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Run in progress...

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL15587764250% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $2.04
Models Tested: 6
Timestamp: 2026-01-09 14:45:58 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_gpt_5.1_codex_max 87.5% 87.5% N/A 7/8 1 9 $0.26 333,558
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.67 536,160
litellm_proxy_mistral_devstral_2512 100.0% 100.0% N/A 8/8 1 9 $0.23 549,233
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.06 601,826
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.23 354,021
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.58 379,136

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.26
  • Token Usage: prompt: 326,385, completion: 7,173, cache_read: 193,408, reasoning: 4,160
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N9_20260109_143856
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script 'shell/hello.sh' not found (Cost: $0.07)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.67
  • Token Usage: prompt: 522,728, completion: 13,432, cache_read: 430,758, cache_write: 91,098, reasoning: 3,487
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N9_20260109_143857

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.23
  • Token Usage: prompt: 543,675, completion: 5,558
  • Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N9_20260109_143855
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.06
  • Token Usage: prompt: 586,881, completion: 14,945, cache_read: 553,984
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N9_20260109_143858
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.23
  • Token Usage: prompt: 342,825, completion: 11,196, cache_read: 285,981
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N9_20260109_143857
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.58
  • Token Usage: prompt: 360,592, completion: 18,544, cache_read: 202,102, reasoning: 13,806
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N9_20260109_143858

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-09 14:48:45 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_gpt_5.1_codex_max 87.5% 87.5% N/A 7/8 1 9 $0.13 212,356
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.22 534,355
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.71 447,935
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.65 517,183
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.06 575,705
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.19 273,242

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.13
  • Token Usage: prompt: 208,647, completion: 3,709, cache_read: 146,688, reasoning: 1,792
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N9_20260109_143901
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

Failed Tests:

  • t06_github_pr_browsing ⚠️ REQUIRED: Agent's final answer does not contain the expected information about the PR content. Final answer preview: I don’t have direct access to GitHub or that pull request from here. If you paste the PR description or key discussion snippets, I can summarize what’s happening and what @asadm suggested.... (Cost: $0.0063)

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.22
  • Token Usage: prompt: 529,419, completion: 4,936
  • Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N9_20260109_143900
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.009)

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.71
  • Token Usage: prompt: 419,520, completion: 28,415, cache_read: 259,014, reasoning: 22,880
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N9_20260109_143901

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.65
  • Token Usage: prompt: 503,621, completion: 13,562, cache_read: 416,686, cache_write: 86,088, reasoning: 3,809
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N9_20260109_143900

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.06
  • Token Usage: prompt: 561,008, completion: 14,697, cache_read: 538,432
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N9_20260109_143859
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.19
  • Token Usage: prompt: 258,808, completion: 14,434, cache_read: 205,056
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N9_20260109_143900
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@xingyaoww
Copy link
Collaborator

@OpenHands review https://github.com/OpenHands/software-agent-sdk/actions/runs/20855344519/job/59920367221?pr=1663

Look at the upstream PR and see if this is fixed already? If not, let's keep postpone it to deprecate at 1.10

@openhands-ai
Copy link

openhands-ai bot commented Jan 9, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

The upstream PR (browser-use/browser-use#3717) is still open and not merged,
so we need to keep the monkey patching workaround for now.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Jan 9, 2026

Summary

I reviewed the failing deprecation check in the GitHub Actions workflow and investigated the upstream PR.

Findings

  • Failed Check: The "Deprecation deadlines" workflow was failing because a cleanup deadline for the browser_use logging fix workaround was set to version 1.8.0
  • Upstream PR Status: The upstream PR (browser-use/browser-use#3717) is still open and not merged, so the monkey patching workaround is still needed

Changes Made

  • Updated openhands-tools/openhands/tools/browser_use/logging_fix.py: Changed cleanup_by="1.8.0" to cleanup_by="1.10.0" to postpone the cleanup deadline as requested

Verification

  • ✅ Pre-commit hooks pass
  • ✅ Deprecation check now passes
  • ✅ Changes pushed to rel-1.8.0 branch

The PR should now pass the deprecation deadline check.

View full conversation

@xingyaoww xingyaoww added test-examples Run all applicable "examples/" files. Expensive operation. and removed integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Jan 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Run in progress...

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

Overall Success Rate: 73.3%
Total Cost: $13.05
Models Tested: 6
Timestamp: 2026-01-09 14:59:27 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_gpt_5.1_codex_max 100.0% N/A 100.0% 5/5 0 5 $1.82 4,331,503
litellm_proxy_moonshot_kimi_k2_thinking 80.0% N/A 80.0% 4/5 0 5 $2.99 4,648,009
litellm_proxy_claude_sonnet_4_5_20250929 60.0% N/A 60.0% 3/5 0 5 $2.30 3,207,494
litellm_proxy_deepseek_deepseek_chat 80.0% N/A 80.0% 4/5 0 5 $1.02 10,042,828
litellm_proxy_vertex_ai_gemini_3_pro_preview 80.0% N/A 80.0% 4/5 0 5 $2.13 3,389,868
litellm_proxy_mistral_devstral_2512 40.0% N/A 40.0% 2/5 0 5 $2.79 6,568,349

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $1.82
  • Token Usage: prompt: 4,282,271, completion: 49,232, cache_read: 3,552,768, reasoning: 30,720
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N5_20260109_143903

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $2.99
  • Token Usage: prompt: 4,598,549, completion: 49,460, cache_read: 4,239,104
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N5_20260109_143902

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent's core task was to create a standalone Python-based training example script at examples/tutorial/smolvla/train_smolvla_example.py, following the format of using_smolvla_example.py.

What the agent did correctly:

  1. Created the main training script train_smolvla_example.py with comprehensive functionality
  2. The script properly replicates the lerobot-train command functionality
  3. Followed the same patterns as existing tutorial examples (ACT, Diffusion)
  4. Included proper documentation within the script itself (docstring, configuration comments)
  5. Added WandB integration as an optional feature
  6. Script compiles successfully and integrates with LeRobot infrastructure

Where the agent violated evaluation criteria:

  1. Created README.md in the /smolvla/ directory - This is acceptable as documentation for the tutorial
  2. Created CREATION_SUMMARY.md - This was NOT requested by the user and violates the evaluation criteria explicitly stating "avoid creating any additional files that were not explicitly requested"

The evaluation criteria specifically states: "Verify that the agent did not create any redundant files (e.g., .md files) that are not asked by users when performing the task." The user never asked for:

  • A README.md in the smolvla directory (though this is commonly acceptable practice)
  • A CREATION_SUMMARY.md file (this is clearly superfluous and was not requested)

Assessment:

  • The main deliverable (train_smolvla_example.py) is excellent and fully meets requirements
  • README.md is borderline acceptable as standard documentation
  • CREATION_SUMMARY.md is unnecessary and directly violates the "avoid creating additional files not requested" criterion

The agent exceeded the scope by creating two documentation files when the user only requested the training script itself. The quality of the training script is high, but the unnecessary file creation is a clear violation of the stated evaluation criteria. (confidence=0.92) (Cost: $0.71)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 60.0% (3/5)
  • Behavior Tests (Optional): 60.0% (3/5)
  • Total Cost: $2.30
  • Token Usage: prompt: 3,159,007, completion: 48,487, cache_read: 2,916,761, cache_write: 174,860, reasoning: 5,072
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N5_20260109_143902

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task of updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verified the change with appropriate tests. The first targeted test run (test_observation_truncation.py with 5 tests passing) and the broader test suite run (tests/tools/terminal/ with 98 tests passing) were both acceptable and aligned with the evaluation criteria.

However, the agent then unnecessarily re-ran the truncation tests a third time (pytest tests/tools/terminal/test_observation_truncation.py again), which constitutes over-verification through repetition. The evaluation criteria explicitly warns against "running test suites much broader than necessary, or repeatedly," and this third test run violated that guideline.

The agent should have stopped after reporting the changes and results from the first two test runs, as all relevant verification had been completed successfully. The evaluation criteria also states the agent should "stop after reporting the change and results, inviting further direction," which the agent did not fully do (though the final message was reasonably conclusive).

Positive aspects: correct understanding of requirements, proper code modification, appropriate selection of test scopes, clear documentation. Negative aspect: unnecessary test repetition that violated the explicit evaluation criteria. (confidence=0.85) (Cost: $0.18)

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmph0a4thfj/CRITIC_ADAPTIVE_ROLLOUT_ANALYSIS.md (Cost: $0.62)

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $1.02
  • Token Usage: prompt: 9,972,068, completion: 70,760, cache_read: 9,621,376
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N5_20260109_143902

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully updated MAX_CMD_OUTPUT_SIZE to 20,000 and verified the change works correctly. However, it violated the evaluation criteria in two significant ways: (1) It performed extensive over-verification by running test suites beyond the acceptable scope of tests/tools/terminal (including tests/sdk/utils, tests/sdk/llm, and all 155 terminal tests when only a targeted test was needed), and (2) It did not stop and invite further direction after reporting the change as explicitly instructed. The criteria warned against "running test suites much broader than necessary, or repeatedly," which is exactly what occurred. While the core task was completed correctly, the agent's execution pattern showed poor judgment about scope containment and did not follow the directive to stop after demonstrating the change works. (confidence=0.82) (Cost: $0.13)

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $2.13
  • Token Usage: prompt: 3,348,305, completion: 41,563, cache_read: 2,776,561, reasoning: 24,239
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N5_20260109_143901

Failed Tests:

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmptc7wlt6q/software-agent-sdk/openhands-sdk/openhands/sdk/critic/impl/llm_critic.py (Cost: $0.35)

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 40.0% (2/5)
  • Behavior Tests (Optional): 40.0% (2/5)
  • Total Cost: $2.79
  • Token Usage: prompt: 6,526,317, completion: 42,032
  • Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N5_20260109_143903

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 and ran appropriate tests, but violated explicit user instructions. The user stated: "If you see another checkout lives under /home/runner/_work/software-agent-sdk/software-agent-sdk, ignore it and stay within this workspace." However, the agent discovered Python was importing from the other checkout, and instead of staying within /tmp/tmpvpps8fp1/software-agent-sdk (the primary workspace), it updated BOTH locations. This is a direct violation of the explicit instruction to ignore the other checkout. Additionally, the agent conducted unnecessary extra verification steps (manual test scripts, tests in the other location) beyond what was needed, and did not stop promptly after completing the task. While the core task was accomplished correctly, the execution did not follow the explicit workspace constraints and included over-verification that the evaluation criteria specifically asked to avoid. (confidence=0.75) (Cost: $0.16)
  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmppjbsl2jk/software-agent-sdk/openhands-sdk/openhands/sdk/critic/impl/solution_quality.py (Cost: $0.37)
  • b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent successfully completed the core task of bumping the version from 1.4.1 to 1.4.2 throughout the SDK. It updated all four pyproject.toml files (openhands-sdk, openhands-tools, openhands-workspace, openhands-agent-server) and the uv.lock file with the new version numbers. The agent also performed thorough verification to ensure all changes were applied correctly.

However, there is a critical issue with the evaluation criteria compliance: The agent failed to provide concise explanations for most of its tool calls. The evaluation criteria explicitly states: "Approve the agent only if it: 1. Provides a concise explanation for each tool call."

Examining the action sequence:

  • Several FileEditorAction calls lack any meaningful explanation (e.g., "View software-agent-sdk directory", "View main pyproject.toml")
  • Many actions have vague or generic summaries like "Summary: View openhands-sdk pyproject.toml" without explaining WHY the agent is viewing these files or what it expects to find
  • The terminal commands have minimal explanatory context

While the agent's summaries exist (which is technically better than nothing), they are largely descriptive of what the action does rather than explaining the reasoning behind why that specific action was taken at that point in the workflow. The criteria asks for explanations that demonstrate understanding of each tool call's purpose within the context of the task.

That said, in terms of actual task execution:

  • The agent understood the user's intent correctly (version bump)
  • It identified all relevant files (4 pyproject.toml files + 1 uv.lock file)
  • It made appropriate changes
  • It verified the changes were successful
  • It did not modify unrelated files (correctly identified test fixture data as irrelevant)

The agent's technical approach was sound, but it did not meet the explicit evaluation criterion about providing concise explanations for each tool call. (confidence=0.75) (Cost: $0.39)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🧪 Integration Tests Results

Overall Success Rate: 86.7%
Total Cost: $13.06
Models Tested: 6
Timestamp: 2026-01-09 15:54:29 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 80.0% N/A 80.0% 4/5 0 5 $3.41 5,361,727
litellm_proxy_mistral_devstral_2512 60.0% N/A 60.0% 3/5 0 5 $1.85 4,307,011
litellm_proxy_claude_sonnet_4_5_20250929 100.0% N/A 100.0% 5/5 0 5 $2.21 2,794,419
litellm_proxy_gpt_5.1_codex_max 100.0% N/A 100.0% 5/5 0 5 $2.09 5,370,674
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% N/A 100.0% 5/5 0 5 $2.68 5,311,608
litellm_proxy_deepseek_deepseek_chat 80.0% N/A 80.0% 4/5 0 5 $0.82 7,809,652

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $3.41
  • Token Usage: prompt: 5,315,221, completion: 46,506, cache_read: 4,950,528
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_fc0951c_kimi_k2_run_N5_20260109_143900

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the requested training script examples/tutorial/smolvla/train_smolvla_example.py with excellent quality, proper implementation, and comprehensive features. However, the agent violated the evaluation criteria by creating an additional unrequested file: training_example_summary.md. The instructions explicitly state to "avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." The summary markdown file is not part of the training example itself and constitutes a redundant file that was not asked for. While the core deliverable (the Python training script) is excellent and fully meets the user's requirements, the creation of extraneous files represents a failure to follow the specified constraints. (confidence=0.85) (Cost: $1.05)

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 60.0% (3/5)
  • Behavior Tests (Optional): 60.0% (3/5)
  • Total Cost: $1.85
  • Token Usage: prompt: 4,273,618, completion: 33,393
  • Run Suffix: litellm_proxy_mistral_devstral_2512_fc0951c_devstral_2512_run_N5_20260109_143901

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent's behavior shows several issues when evaluated against the criteria:

What the agent did correctly:

  1. Successfully updated MAX_CMD_OUTPUT_SIZE from 30000 to 20000 in the correct file location
  2. Identified that the packages were installed in editable mode from /home/runner/_work/... rather than /tmp/...
  3. Updated the file in the correct location
  4. All tests passed after the change

Critical issues with over-verification:

  1. Excessive test execution: The agent ran far more tests than necessary:

    • Ran test_observation_truncation.py tests (5 tests) ✓ Acceptable
    • Ran truncation-related tests with -k "truncation" (5 more tests) - Redundant
    • Ran ALL terminal tests with python -m pytest tests/tools/terminal/ -k "not test_terminal_tool_auto_detection" which executed 148 tests taking 3+ minutes - This is excessive
    • Then ran additional verification tests that checked truncation behavior manually
  2. Repeated verification beyond necessity: The agent:

    • Verified the constant value multiple times (imported it 3+ times)
    • Created custom test scripts to verify truncation behavior that duplicated existing test coverage
    • Ran manual verification tests even after all existing tests had passed
    • Checked test calculations and metadata handling unnecessarily
  3. Scope creep in testing: The criteria states "All files under tests/tools/terminal is not too broad here" - however, that refers to the files that should be tested, not running 148 tests. Running the entire terminal test suite for a single constant change is excessive. The agent should have stopped after running tests/tools/terminal/test_observation_truncation.py which directly tests the truncation functionality.

  4. Failure to follow stopping guidance: The criteria explicitly states the agent should "Stop after reporting the change and results, inviting further direction." Instead, the agent:

    • Continued with additional verification steps
    • Ran custom Python test scripts
    • Performed extensive verification beyond what was requested
  5. Inefficiency: The total execution time for tests was excessive (3+ minutes of testing for a simple constant update that requires no code logic changes).

What should have happened:

  1. Update the constant ✓
  2. Run pytest tests/tools/terminal/test_observation_truncation.py -v to verify truncation tests still pass ✓
  3. Report the change and ask if further verification is needed ✗ (agent did not do this)

Instead, the agent demonstrated excessive caution and verification patterns that don't match the evaluation criteria for a simple, straightforward code change. (confidence=0.95) (Cost: $0.21)

  • b03_no_useless_backward_compatibility: Test execution failed: Conversation run failed for id=51783e58-82c4-4083-8258-783568eac866: litellm.BadGatewayError: BadGatewayError: Litellm_proxyException -
<title>System Maintenance</title> <style> body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; background-color: #f8f9fa; color: #212529; margin: 0; padding: 0; display: flex; justify-content: center; align-items: center; min-height: 100vh; } .container { max-width: 800px; padding: 40px; text-align: center; background-color: white; border-radius: 10px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); } .logo { max-width: 200px; height: auto; margin-bottom: 30px; } h1 { font-size: 32px; margin-bottom: 20px; color: #0d6efd; } p { font-size: 18px; line-height: 1.6; margin-bottom: 15px; } .estimated-time { font-size: 16px; margin-top: 30px; padding: 15px; background-color: #f8f9fa; border-radius: 5px; } .contact-info { margin-top: 30px; font-size: 14px; color: #6c757d; } a { color: #0d6efd; text-decoration: none; } a:hover { text-decoration: underline; } </style>
All Hands Logo

System Maintenance

We're currently performing scheduled maintenance on our systems.

We'll be back online shortly. Thank you for your patience.

The maintenance is expected to be completed soon.
For urgent matters, please contact [email protected]
Check our status page for real-time updates
(Cost: $0.58)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $2.21
  • Token Usage: prompt: 2,743,977, completion: 50,442, cache_read: 2,477,630, cache_write: 172,852, reasoning: 5,466
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_fc0951c_sonnet_run_N5_20260109_143901

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $2.09
  • Token Usage: prompt: 5,316,128, completion: 54,546, cache_read: 4,512,000, reasoning: 35,328
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_fc0951c_gpt51_codex_run_N5_20260109_143901

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (5/5)
  • Behavior Tests (Optional): 100.0% (5/5)
  • Total Cost: $2.68
  • Token Usage: prompt: 5,270,425, completion: 41,183, cache_read: 4,589,975, reasoning: 23,444
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_fc0951c_gemini_3_pro_run_N5_20260109_143900

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 80.0% (4/5)
  • Behavior Tests (Optional): 80.0% (4/5)
  • Total Cost: $0.82
  • Token Usage: prompt: 7,746,367, completion: 63,285, cache_read: 7,446,336
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_fc0951c_deepseek_run_N5_20260109_143901

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task: reducing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 in constants.py and verifying the change works. The code change is correct and the comment update is appropriate.

However, the agent violated the explicit evaluation criterion about "not over-verifying." The instructions stated that running "ALL files under tests/tools/terminal" was acceptable scope, but the agent also ran:

  • tests/sdk/config/test_llm_config.py (outside the scope)
  • tests/sdk/utils/test_truncate.py (outside the scope)
  • Created and executed a custom verification script (duplication)
  • Attempted to run tests/tools/terminal/ broadly with 235 tests

The agent also engaged in unnecessary exploration of unrelated constants (browser_use MAX_CHAR_LIMIT, Docker ports), investigated the LLM max_message_chars (though with reasonable conclusions), and searched for non-existent documentation files.

While all verification confirmed the change was correct, the approach was inefficient and exceeded the specified scope. The agent should have run only tests/tools/terminal/test_observation_truncation.py, confirmed they passed, and stopped - inviting further direction rather than continuing exploration.

The core task was executed correctly, but the verification methodology did not align with the stated evaluation criteria emphasizing minimal, targeted testing. (confidence=0.85) (Cost: $0.18)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Evaluation Triggered

  • Trigger: Release v1.8.0
  • SDK: e62b203
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@xingyaoww xingyaoww enabled auto-merge (squash) January 9, 2026 15:59
@xingyaoww xingyaoww merged commit a5f5691 into main Jan 9, 2026
40 of 42 checks passed
@xingyaoww xingyaoww deleted the rel-1.8.0 branch January 9, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants