Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Dec 24, 2025

Summary

  • Replace GPT-5.1 Codex Max with GPT-5.2 Codex in .github/workflows/integration-runner.yml so integration tests exercise the latest Codex variant.

Verification

  • Ran the examples/01_standalone_sdk/01_hello_world.py locally pointing to the eval proxy with model litellm_proxy/gpt-5.2-codex. The run reached the proxy and attempted to use the model, confirming model id resolves at the proxy; however, access to gpt-5.2-codex may depend on environment credentials. In CI, the integration workflow uses the managed LLM proxy + key.

Notes

  • No other tests modified. This PR only updates the workflow matrix entry.

Co-authored-by: openhands [email protected]

@enyst can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:7743988-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-7743988-python \
  ghcr.io/openhands/agent-server:7743988-python

All tags pushed for this build

ghcr.io/openhands/agent-server:7743988-golang-amd64
ghcr.io/openhands/agent-server:7743988-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:7743988-golang-arm64
ghcr.io/openhands/agent-server:7743988-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:7743988-java-amd64
ghcr.io/openhands/agent-server:7743988-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:7743988-java-arm64
ghcr.io/openhands/agent-server:7743988-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:7743988-python-amd64
ghcr.io/openhands/agent-server:7743988-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:7743988-python-arm64
ghcr.io/openhands/agent-server:7743988-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:7743988-golang
ghcr.io/openhands/agent-server:7743988-java
ghcr.io/openhands/agent-server:7743988-python

About Multi-Architecture Support

  • Each variant tag (e.g., 7743988-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 7743988-python-amd64) are also available if needed

…place GPT-5.1 Codex Max with GPT-5.2 Codex in integration-runner.yml matrix so CI exercises the latest codex family.\n\nCo-authored-by: openhands <[email protected]>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! (I was messing around the bot account and forget to log out 😓 )

@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Dec 28, 2025
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 94.1%
Total Cost: $1.76
Models Tested: 6
Timestamp: 2025-12-28 19:10:17 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.16 378,358
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.36 564,891
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.07 702,676
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.46 268,469
litellm_proxy_gpt_5.1_codex_max 77.8% 77.8% N/A 7/9 0 9 $0.19 208,981
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.51 341,318

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.16
  • Token Usage: prompt: 374,809, completion: 3,549
  • Run Suffix: litellm_proxy_mistral_devstral_2512_d6e2cc1_devstral_2512_run_N9_20251228_190101
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.36
  • Token Usage: prompt: 551,324, completion: 13,567, cache_read: 485,120
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_d6e2cc1_kimi_k2_run_N9_20251228_190100
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.07
  • Token Usage: prompt: 689,669, completion: 13,007, cache_read: 633,408
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_d6e2cc1_deepseek_run_N9_20251228_190101
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.46
  • Token Usage: prompt: 259,016, completion: 9,453, cache_read: 188,035, cache_write: 70,503, reasoning: 2,472
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_d6e2cc1_sonnet_run_N9_20251228_190100

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 77.8% (7/9)
  • Integration Tests (Required): 77.8% (7/9)
  • Total Cost: $0.19
  • Token Usage: prompt: 203,023, completion: 5,958, cache_read: 108,672, reasoning: 3,904
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_d6e2cc1_gpt51_codex_run_N9_20251228_190101

Failed Tests:

  • t09_token_condenser ⚠️ REQUIRED: Condensation not triggered. Token counting may not work. (Cost: $0.0059)
  • t06_github_pr_browsing ⚠️ REQUIRED: Agent's final answer does not contain the expected information about the PR content. Final answer preview: I don’t have network access here, so I can’t open that GitHub PR. If you paste the PR description or comments (especially @asadm’s), I can summarize and explain what’s happening.... (Cost: $0.01)

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.51
  • Token Usage: prompt: 323,623, completion: 17,695, cache_read: 196,601, reasoning: 12,394
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_d6e2cc1_gemini_3_pro_run_N9_20251228_190101

@xingyaoww
Copy link
Collaborator

We probably need to merge it to test it

@xingyaoww xingyaoww merged commit 0d18f79 into main Dec 28, 2025
53 checks passed
@xingyaoww xingyaoww deleted the chore/update-integration-codex-5_2 branch December 28, 2025 19:25
xingyaoww added a commit that referenced this pull request Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci integration integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants