Skip to content

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Dec 5, 2025

Summary

This PR implements an optional action summary feature that allows the LLM to provide brief (~10 word) summaries with each tool call, describing what the agent is doing. The implementation follows the same pattern as the existing security risk analyzer feature.

Motivation

Agent actions can sometimes be opaque to users and developers. This feature provides transparency by adding human-readable summaries to each action, making it easier to:

  • Track what the agent is doing at each step
  • Debug agent behavior
  • Understand agent decisions
  • Maintain audit trails
  • Improve user experience

Changes

Core Implementation

  1. ActionEvent Enhancement (action.py)

    • Added summary: str | None field to store summaries
    • Updated visualize property to display summary when present
  2. Conversation Configuration

    • Added enable_action_summaries: bool field to ConversationState (default: False)
    • Added parameter to LocalConversation.__init__() for initialization-time configuration
    • Configuration is set at conversation creation time only
  3. Tool Schema Enhancement (tool.py)

    • Created _create_action_type_with_summary() helper function
    • Added add_summary_prediction: bool parameter to schema methods
    • Dynamically injects summary field into tool schemas when enabled
  4. LLM Integration (llm.py, router/base.py, agent/utils.py)

    • Added add_summary_prediction: bool parameter throughout the LLM pipeline
    • Propagates the flag from conversation state to tool schemas
  5. Agent Processing (agent/agent.py)

    • Created _extract_summary() method with comprehensive validation
    • Validates summary is a string, not empty, and present when required
    • Integrated summary extraction into the action event workflow

Testing

Added 24 comprehensive tests across 3 test files:

  • 13 tests for summary extraction and validation (test_extract_summary.py)
  • 6 tests for tool schema enhancement (test_to_responses_tool_summary.py)
  • 5 tests for ActionEvent integration (test_action_event_summary.py)

All tests pass and all pre-commit hooks pass (ruff, pyright, pycodestyle, etc.).

image

Usage

Enable action summaries at conversation creation:

conversation = LocalConversation(
    agent=agent,
    workspace="/tmp",
    enable_action_summaries=True  # Enable at initialization
)

Example Output

When enabled, ActionEvents include summaries in their visualization:

🛠️  ACTION (agent)
Tool: str_replace_editor
Summary: editing configuration file for deployment settings
Security: MEDIUM
Arguments: {...}

Design Decisions

  1. Pattern Consistency: Follows the exact pattern used by the security risk analyzer for consistency and maintainability

  2. Backward Compatibility: Feature is disabled by default; no breaking changes to existing APIs

  3. Validation Strategy: Strict validation when enabled (must be present, must be string, must not be empty)

  4. Configuration Simplicity: Set at initialization time only - no dynamic toggling needed

  5. Composability: Works alongside security risk analysis - both fields can be present simultaneously

Comparison with Security Risk Feature

Aspect Security Risk Action Summary
Field Type SecurityRisk enum `str
Flag Name add_security_risk_prediction add_summary_prediction
State Field Via SecurityAnalyzer enable_action_summaries
Validation Enum validation String type + content validation
Can Combine ✅ Yes - both fields can be present

Files Changed

  • openhands-sdk/openhands/sdk/event/llm_convertible/action.py
  • openhands-sdk/openhands/sdk/conversation/state.py
  • openhands-sdk/openhands/sdk/conversation/base.py
  • openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py
  • openhands-sdk/openhands/sdk/tool/tool.py
  • openhands-sdk/openhands/sdk/llm/llm.py
  • openhands-sdk/openhands/sdk/llm/router/base.py
  • openhands-sdk/openhands/sdk/agent/agent.py
  • openhands-sdk/openhands/sdk/agent/utils.py
  • tests/sdk/agent/test_extract_summary.py (new)
  • tests/sdk/tool/test_to_responses_tool_summary.py (new)
  • tests/sdk/event/test_action_event_summary.py (new)
  • IMPLEMENTATION_SUMMARY.md (new - internal documentation)

Testing Checklist

  • All new tests pass (24/24)
  • All existing tests still pass
  • Ruff format passes
  • Ruff lint passes
  • Pyright type checking passes
  • PEP8 style checks pass
  • Import dependency rules pass
  • Tool registration checks pass

Future Enhancements

Potential improvements for future iterations:

  • Enforce 10-word limit in validation
  • Add summary quality metrics
  • Support multi-language summaries
  • Integration with telemetry/monitoring systems

Co-authored-by: openhands [email protected]


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:90574b1-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-90574b1-python \
  ghcr.io/openhands/agent-server:90574b1-python

All tags pushed for this build

ghcr.io/openhands/agent-server:90574b1-golang-amd64
ghcr.io/openhands/agent-server:90574b1-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:90574b1-golang-arm64
ghcr.io/openhands/agent-server:90574b1-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:90574b1-java-amd64
ghcr.io/openhands/agent-server:90574b1-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:90574b1-java-arm64
ghcr.io/openhands/agent-server:90574b1-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:90574b1-python-amd64
ghcr.io/openhands/agent-server:90574b1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:90574b1-python-arm64
ghcr.io/openhands/agent-server:90574b1-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:90574b1-golang
ghcr.io/openhands/agent-server:90574b1-java
ghcr.io/openhands/agent-server:90574b1-python

About Multi-Architecture Support

  • Each variant tag (e.g., 90574b1-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 90574b1-python-amd64) are also available if needed

Implement optional 10-word summaries for agent actions, following the same pattern as the security risk analyzer feature.

Changes:
- Add summary field to ActionEvent with visualization support
- Add enable_action_summaries flag to ConversationState
- Create _extract_summary() method in Agent with validation
- Update tool schema methods to support add_summary_prediction flag
- Integrate summary extraction into agent step flow
- Add comprehensive tests for all components

The feature can be enabled by calling conversation.set_enable_action_summaries(True), which prompts the LLM to provide a brief summary with each tool call.

Co-authored-by: openhands <[email protected]>
Allow users to enable action summaries when creating a conversation:
- Add enable_action_summaries parameter to LocalConversation.__init__()
- Add enable_action_summaries parameter to ConversationState.create()
- Update documentation and example to show both initialization and dynamic enabling
- Keep set_enable_action_summaries() method for dynamic toggling

Usage:
  conversation = LocalConversation(
      agent=agent,
      workspace="/tmp",
      enable_action_summaries=True
  )

Co-authored-by: openhands <[email protected]>
Simplify configuration by removing the dynamic setter method.
Action summaries are now configured only at initialization time
via the enable_action_summaries parameter.

Co-authored-by: openhands <[email protected]>
@xingyaoww
Copy link
Collaborator Author

@OpenHands Review diff against main, and help me understand what the best and simplest way to implement this is. Do you think if we can just always enable this "summary" feature since it is just a few word and shouldn't hurt the cost too much?

If so, please implement it and remove all the unnecessary plumbing

@openhands-ai
Copy link

openhands-ai bot commented Dec 5, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

Remove enable_action_summaries flag and associated plumbing to simplify
the implementation. The summary field is now always requested from the LLM
and injected into tool schemas, providing better agent transparency at
minimal cost (~10-15 tokens per action).

Key changes:
- Remove enable_action_summaries from ConversationState
- Remove add_summary_prediction parameter from tool.py, llm.py, router, agent
- Remove enable_action_summaries from LocalConversation.__init__()
- Summary is always injected into tool schemas
- ActionEvent.summary can still be None if LLM doesn't provide it
- All tests updated to include summary field in mock tool calls

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Dec 5, 2025

Summary

I've successfully simplified the action summary feature by always enabling the summary field and removing all unnecessary configuration plumbing.

Analysis & Decision

Question answered: Yes, we should always enable the summary feature!

The summary field adds only ~10-15 tokens per action (a few words) which is minimal cost compared to the high explainability value it provides. This makes the feature much simpler - no need for complex configuration.

Changes Made

Simplified Implementation:

  1. Removed configuration flag - enable_action_summaries removed from ConversationState
  2. Removed parameter plumbing - add_summary_prediction parameter removed from:
    • tool.py
    • llm.py
    • router/base.py
    • agent/agent.py
    • agent/utils.py
    • LocalConversation.__init__()
  3. Summary always injected - Tool schemas now always include the summary field
  4. Maintained flexibility - ActionEvent.summary can still be None if LLM doesn't provide it
  5. Updated all tests - Modified 13 test files to include summary field in mock tool calls
  6. Removed internal docs - Deleted IMPLEMENTATION_SUMMARY.md

Quality Assurance:

  • ✅ All 1352 tests passing
  • ✅ All pre-commit checks passing (ruff, pyright, pycodestyle)
  • ✅ Fixed all line length violations

Result

The implementation is now much cleaner with:

  • Less complexity - No configuration flags or conditional logic
  • Better transparency - Summary always available for every action
  • Minimal cost - ~10-15 tokens per action
  • Same functionality - Core feature preserved, just simplified

The changes have been committed and pushed to the feature/action-summary branch, updating PR #1339.

View full conversation

The summary field is always requested from LLM for transparency,
but if the LLM doesn't provide it or provides invalid data, we
gracefully handle it by returning None instead of raising errors.

This approach:
- Minimizes changes to existing tests (no need to add summary to all mocks)
- Makes the feature robust (doesn't break if LLM forgets summary)
- Still provides transparency when LLM does provide summaries
- Follows the principle of 'request but don't require'

Changes:
- _extract_summary now returns str | None and never raises errors
- summary field in tool schema has default=None (optional)
- Updated only 3 test files to account for summary field
- Reverted changes to 11 test files that no longer need summary

Co-authored-by: openhands <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   agent.py2067265%89, 93, 144, 148–149, 157–159, 169–170, 186–188, 195–197, 199, 203, 206–207, 209–210, 228, 255, 260, 271, 310, 315, 326, 329, 352, 362–363, 383, 408–410, 412, 424–425, 430–431, 453–454, 459, 471–472, 477–478, 517–518, 524–525, 529, 537–538, 546, 549–551, 556–557, 578, 585, 589–590, 628–630, 633–634, 638
   utils.py571868%63, 77, 83–84, 101–102, 105–107, 110, 168, 170–172, 174–175, 182, 217
openhands-sdk/openhands/sdk/event/llm_convertible
   action.py641576%88, 98–100, 112–118, 125–126, 155–156
openhands-sdk/openhands/sdk/llm
   llm.py40215361%344, 349, 353, 357–358, 361, 365–366, 377–378, 380–381, 385, 402, 420–423, 512–514, 535, 539, 554, 560–561, 585–586, 596, 621–626, 661–662, 665, 669, 681, 686–689, 698, 706–713, 717–720, 722, 735, 739–740, 742–743, 748–749, 751, 758, 761–766, 823–828, 885–886, 889–892, 934, 951, 1005, 1008, 1011–1019, 1023–1025, 1028, 1031–1033, 1040–1041, 1050, 1057–1059, 1063, 1065–1070, 1072–1089, 1092–1096, 1098–1099, 1105–1114
openhands-sdk/openhands/sdk/llm/router
   base.py421759%44–45, 48, 76–77, 79, 82, 109–111, 115, 121–123, 126–127, 129
openhands-sdk/openhands/sdk/tool
   tool.py1462781%223, 255, 265, 280, 306, 315, 318–324, 341, 346–349, 351, 358–360, 432, 461, 464, 468, 476
TOTAL14896715151% 

Addressed feedback to provide concrete examples of what good action summaries
should look like. This helps guide LLM models in generating appropriate,
concise summaries that are action-oriented and specific.

Examples include:
- editing configuration file for deployment settings
- searching codebase for authentication function definitions
- installing required dependencies from package manifest
- running tests to verify bug fix
- viewing directory structure to locate source files

Co-authored-by: openhands <[email protected]>
When the LLM doesn't provide a summary (or provides invalid data),
we now generate a default summary in the format:
{tool_name}: {arguments}

This ensures transparency even when the LLM doesn't explicitly
provide a summary, making it easier for users to understand
what action is being taken.

Changes:
- Modified _extract_summary to accept tool_name parameter
- Generate default summary using tool name and JSON-serialized arguments
- Updated all tests to reflect new behavior
- All tests passing (107 agent tests, 10 extract_summary tests)

Co-authored-by: openhands <[email protected]>
The assertion checking if 'summary' was removed from arguments is
redundant because _extract_summary always uses pop() to remove it.

Co-authored-by: openhands <[email protected]>
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Dec 21, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

2 similar comments
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Dec 21, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@xingyaoww
Copy link
Collaborator Author

@OpenHands please merge from main, resolve all the conflict, and make this PR introduce as little change as possible but still implement the same feature

@openhands-ai
Copy link

openhands-ai bot commented Jan 5, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Resolved conflicts in:
- openhands-sdk/openhands/sdk/agent/utils.py
- openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py
- openhands-sdk/openhands/sdk/tool/tool.py

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Jan 5, 2026

Summary

I successfully merged origin/main into the feature/action-summary branch and resolved all merge conflicts. The changes have been pushed to the remote branch.

Merge Conflicts Resolved

  1. openhands-sdk/openhands/sdk/agent/utils.py: Combined docstring notes from both branches - kept main's security_risk documentation and added the summary field note.

  2. openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py: Merged docstring parameters - kept main's stuck_detection_thresholds documentation and the feature branch's secrets parameter.

  3. openhands-sdk/openhands/sdk/tool/tool.py: Fixed function call from _create_action_type_with_risk (private) to create_action_type_with_risk (public function used by other modules).

PR Changes Summary

The PR introduces a minimal action summary feature with 12 files changed (+487/-17 lines):

  • Core feature: Adds a summary field to ActionEvent that provides a brief (~10 word) description of what each action does
  • Always enabled: Summary field is always added to tool schemas for transparency (no configuration needed)
  • Graceful fallback: If LLM doesn't provide a summary, a default is generated in format {tool_name}: {arguments}
  • Visualization: Summary is displayed in action event visualization when present

Verification

  • ✅ All merge conflicts resolved
  • ✅ Pre-commit hooks pass (Ruff format, lint, pycodestyle, pyright)
  • ✅ All 19 summary-related tests pass
  • ✅ Tool definition tests pass (26 tests)
  • ✅ Tool execution error handling tests pass (2 tests)
  • ✅ Changes pushed to origin/feature/action-summary

The PR is now up-to-date with main and ready for review.

View full conversation

openhands-agent and others added 3 commits January 5, 2026 16:14
- Add verify_all_actions_have_summary() utility function in behavior_utils.py
- Update b04_each_tool_call_has_a_concise_explanation.py to check summary field presence
- Add unit tests for the new utility function

Co-authored-by: openhands <[email protected]>
@xingyaoww
Copy link
Collaborator Author

@OpenHands review diff against main and remove unnecesary tests

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Simplified test files by removing redundant tests:
- test_extract_summary.py: Removed duplicate tests covered by parametrized test
- test_action_event_summary.py: Removed tests for Pydantic built-in functionality
- test_to_responses_tool_summary.py: Removed redundant security+summary combo tests

Reduced from 24 tests to 9 tests while maintaining full coverage of the
actual implementation logic.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

Summary

I reviewed the diff against main and removed unnecessary tests from the action summary feature. The changes have been pushed to the feature/action-summary branch.

Changes Made

Simplified the test files by removing redundant tests:

  1. tests/sdk/agent/test_extract_summary.py:

    • Removed test_extract_summary_arguments_mutation (covered by parametrized test)
    • Removed test_extract_summary_default_generation (covered by parametrized test)
    • Reduced from 3 test functions to 1 parametrized test with 5 cases
  2. tests/sdk/event/test_action_event_summary.py:

    • Removed test_action_event_with_summary (tests Pydantic field assignment)
    • Removed test_action_event_without_summary (tests Pydantic default)
    • Removed test_action_event_summary_serialization (tests Pydantic's built-in model_dump/model_validate)
    • Kept only the 2 visualization tests that test actual implementation logic
  3. tests/sdk/tool/test_to_responses_tool_summary.py:

    • Removed test_to_responses_tool_summary_and_security (redundant combo test)
    • Removed test_to_openai_tool_summary_and_security (redundant combo test)
    • Kept only the 2 essential tests for summary field presence

Result

  • Before: 24 tests across 3 files
  • After: 9 tests across 3 files (5 parametrized + 2 visualization + 2 schema tests)

All tests pass and pre-commit hooks pass. The reduced test suite still provides full coverage of the actual implementation logic while following the repository guidelines to avoid testing Pydantic's built-in functionality.

View full conversation

@xingyaoww xingyaoww requested a review from enyst January 7, 2026 17:15
@xingyaoww xingyaoww marked this pull request as ready for review January 7, 2026 17:15
@xingyaoww
Copy link
Collaborator Author

@OpenHands help me fix the test:

TOTAL 14770 7079 52%
=========================== short test summary info ============================
FAILED tests/cross/test_remote_conversation_live_server.py::test_security_risk_field_with_live_server - AssertionError: Expected to find ActionEvent with finish tool even without security_risk
assert False
================== 1 failed, 53 passed, 14 warnings in 36.80s ==================

@xingyaoww xingyaoww requested review from malhotra5 and neubig January 7, 2026 17:15
@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@xingyaoww
Copy link
Collaborator Author

@OpenHands help me open a new PR in OpenHands/OpenHands-CLI that will use this branch of SDK, and use the summary from the action to replace the current summary of action generated by string truncation heuristic

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

Summary

I've successfully created a new PR in OpenHands/OpenHands-CLI that uses the action summary feature from the SDK's feature/action-summary branch.

PR Created: OpenHands/OpenHands-CLI#303

Changes Made

  1. Updated SDK dependency (pyproject.toml):

    • Configured openhands-sdk and openhands-tools to use the feature/action-summary branch from OpenHands/software-agent-sdk (commit 19927a1)
  2. Modified _extract_meaningful_title() method (openhands_cli/tui/widgets/richlog_visualizer.py):

    • Added a check at the beginning of the method to prefer the LLM-generated summary field when available on events
    • Falls back to the existing string truncation heuristic when summary is not present
    • The summary is truncated to 70 characters if needed

Checklist

  • ✅ Uses the feature/action-summary branch of the SDK
  • ✅ Uses ActionEvent.summary instead of string truncation heuristic for action titles
  • ✅ Falls back gracefully when summary is not available
  • ✅ All 829 tests pass
  • ✅ Pre-commit hooks pass (ruff format, ruff lint, pycodestyle, pyright)
  • ✅ Changes are minimal and focused

Note

This PR is created as a draft since it depends on OpenHands/software-agent-sdk PR #1339 being merged first. Once that PR is merged, the SDK dependency should be updated to use the released version.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

Summary

The test test_security_risk_field_with_live_server is already passing on the current branch. I ran the test multiple times and it consistently passes:

tests/cross/test_remote_conversation_live_server.py::test_security_risk_field_with_live_server PASSED

The test failure you reported appears to have been a transient issue that has since been resolved. The test validates that:

  1. Actions execute successfully with security_risk provided
  2. Actions execute successfully without security_risk (defaults to UNKNOWN)
  3. ActionEvent always has a security_risk attribute

I also ran the full SDK and cross test suites (1706 tests) and all tests pass.

No code changes were needed - the branch feature/action-summary is already in a working state with all tests passing.

View full conversation

@xingyaoww
Copy link
Collaborator Author

@OpenHands please fix the failing actions on PR #1339 at branch feature/action-summary

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

ActionT = TypeVar("ActionT", bound=Action)
ObservationT = TypeVar("ObservationT", bound=Observation)
_action_types_with_risk: dict[type, type] = {}
_action_types_with_summary: dict[type, type] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice with the caching!

Copy link
Collaborator

@malhotra5 malhotra5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The MCP tests were timing out in CI because uvx may need to download
packages on first run. Increased timeout from default 30s to 120s.

Co-authored-by: openhands <[email protected]>
@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Jan 7, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@openhands-ai
Copy link

openhands-ai bot commented Jan 7, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1339 at branch `feature/action-summary`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

🧪 Integration Tests Results

Overall Success Rate: 98.0%
Total Cost: $2.10
Models Tested: 6
Timestamp: 2026-01-07 18:07:20 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.07 691,007
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.19 460,315
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.65 495,177
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.64 418,357
litellm_proxy_gpt_5.1_codex_max 100.0% 100.0% N/A 8/8 1 9 $0.26 278,531
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.29 438,950

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.07
  • Token Usage: prompt: 676,231, completion: 14,776, cache_read: 623,488
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_225c0eb_deepseek_run_N9_20260107_175757
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.19
  • Token Usage: prompt: 455,337, completion: 4,978
  • Run Suffix: litellm_proxy_mistral_devstral_2512_225c0eb_devstral_2512_run_N9_20260107_180133
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.009)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.65
  • Token Usage: prompt: 482,104, completion: 13,073, cache_read: 391,130, cache_write: 90,154, reasoning: 3,439
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_225c0eb_sonnet_run_N9_20260107_180138

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.64
  • Token Usage: prompt: 396,338, completion: 22,019, cache_read: 233,224, reasoning: 17,103
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_225c0eb_gemini_3_pro_run_N9_20260107_175813

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.26
  • Token Usage: prompt: 271,457, completion: 7,074, cache_read: 133,888, reasoning: 4,352
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_225c0eb_gpt51_codex_run_N9_20260107_175903
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.29
  • Token Usage: prompt: 427,021, completion: 11,929, cache_read: 361,216
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_225c0eb_kimi_k2_run_N9_20260107_175750
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@xingyaoww xingyaoww merged commit 1220936 into main Jan 7, 2026
46 of 47 checks passed
@xingyaoww xingyaoww deleted the feature/action-summary branch January 7, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants