feat(examples): update critic example to demonstrate iterative refinement by xingyaoww · Pull Request #1879 · OpenHands/software-agent-sdk

xingyaoww · 2026-02-02T16:44:40Z

Summary

Updates 34_critic_example.py to demonstrate iterative refinement with the critic model, showing how to shepherd an agent through complex, multi-step tasks.

What Changed

Updated Critic Example (`examples/01_standalone_sdk/34_critic_example.py`)

The example now demonstrates:

Complex Multi-Step Task: Instead of a simple file creation, the agent is asked to create a complete Python project structure with:
- A calculator_app package with multiple modules
- Unit tests with at least 4 test cases
- A README with documentation
- Running and verifying tests pass

Critic Result Collection: Uses callbacks to capture critic evaluations in real-time:

class CriticResultCollector:
    def callback(self, event: Event) -> None:
        if isinstance(event, (ActionEvent, MessageEvent)):
            if event.critic_result is not None:
                # Capture and process critic feedback

Iterative Refinement Loop: When the critic score is below the threshold (60% by default), sends follow-up prompts to guide the agent:
- Extracts potential issues from critic metadata
- Crafts specific guidance based on what's incomplete
- Continues until success threshold is met or max iterations reached
Configurable Parameters:
- CRITIC_SUCCESS_THRESHOLD (default: 0.6)
- MAX_ITERATIONS (default: 3)

Updated AGENTS.md

Added a <RUNNING_EXAMPLES> section documenting how to run SDK examples with LLM_API_KEY:

LLM_BASE_URL="https://llm-proxy.eval.all-hands.dev" LLM_API_KEY="$LLM_API_KEY" \
  uv run python examples/01_standalone_sdk/<example_name>.py

Testing

Verified the example runs successfully:

Agent completed the task with critic score of 0.758 (75.8%)
Success threshold (60%) was met on first iteration
All expected files were created and tests passed

Example Output

======================================================================
🚀 Starting Iterative Refinement with Critic Model
======================================================================
Success threshold: 60%
Max iterations: 2

--- Iteration 1: Initial Task ---
...
📊 Critic Score: 0.758
   Details: Success: 0.76; {"success": 0.7577772736549377, ...

📈 Iteration 1 final score: 0.758
✅ Success threshold (60%) met!

======================================================================
📊 ITERATIVE REFINEMENT COMPLETE
======================================================================
Total iterations: 1
Final critic score: 0.758
Success: ✅ YES

Related Documentation

Critic Guide: https://docs.openhands.dev/sdk/guides/critic

@xingyaoww can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bd4e827-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bd4e827-python \
  ghcr.io/openhands/agent-server:bd4e827-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bd4e827-golang-amd64
ghcr.io/openhands/agent-server:bd4e827-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bd4e827-golang-arm64
ghcr.io/openhands/agent-server:bd4e827-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bd4e827-java-amd64
ghcr.io/openhands/agent-server:bd4e827-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bd4e827-java-arm64
ghcr.io/openhands/agent-server:bd4e827-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bd4e827-python-amd64
ghcr.io/openhands/agent-server:bd4e827-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:bd4e827-python-arm64
ghcr.io/openhands/agent-server:bd4e827-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:bd4e827-golang
ghcr.io/openhands/agent-server:bd4e827-java
ghcr.io/openhands/agent-server:bd4e827-python

About Multi-Architecture Support

Each variant tag (e.g., bd4e827-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., bd4e827-python-amd64) are also available if needed

…ment Update 34_critic_example.py to demonstrate: - Using the critic model to evaluate agent task completion - Capturing critic results via callbacks - Iterating with follow-up prompts when critic scores are low - Complex multi-step task (creating a Python project structure) Also add RUNNING_EXAMPLES section to AGENTS.md documenting how to run examples with LLM_API_KEY. Co-authored-by: openhands <[email protected]>

AGENTS.md

xingyaoww · 2026-02-03T16:08:11Z

examples/01_standalone_sdk/34_critic_example.py

+def get_initial_task_prompt() -> str:
+    """Generate a complex, multi-step task prompt.
+
+    This task is designed to be challenging enough that the agent may
+    not complete it perfectly on the first try, demonstrating the
+    value of the critic in guiding improvements.
+    """
+    return (
+        "Please help me create a complete Python project structure "
+        "with the following:\n\n"
+        "1. Create a directory called 'calculator_app'\n\n"
+        "2. Inside calculator_app, create these files:\n"
+        "   - __init__.py (empty)\n"
+        "   - calculator.py - A Calculator class with add, subtract, "
+        "multiply, divide methods\n"
+        "   - utils.py - Helper functions for input validation\n"
+        "   - main.py - CLI entry point that takes two numbers and an operation\n\n"
+        "3. Create a tests/ directory with:\n"
+        "   - __init__.py (empty)\n"
+        "   - test_calculator.py - Unit tests for the Calculator class "
+        "(at least 4 tests)\n\n"
+        "4. Create a README.md with:\n"
+        "   - Project title and description\n"
+        "   - Installation instructions\n"
+        "   - Usage examples\n"
+        "   - Test running instructions\n\n"
+        "5. Run the tests to verify everything works correctly\n\n"
+        "Please complete ALL of these steps. After you're done, verify each "
+        "file exists and the tests pass."
+    )


@OpenHands
Write this directly as a multi-line python string rather than a function.

Can you make this a MORE complex task (eg a system that requires a frontend, backend, login authentication, etc) so agent is more likely to fail.

Make sure you run end-to-end and make eveyrthing work before report back.

I'm on it! xingyaoww can track my progress at all-hands.dev

Summary

I've successfully updated the critic example (examples/01_standalone_sdk/34_critic_example.py) as requested:

Changes Made

✅ Converted get_initial_task_prompt() to a multi-line string constant

Replaced the function with INITIAL_TASK_PROMPT as a direct multi-line Python string

✅ Made the task MORE complex (full-stack application)

Backend (Flask API): 6 REST endpoints with JWT authentication

User registration and login

Task CRUD operations (create, read, update, delete)

Frontend (HTML/CSS/JS): Login/register forms, task management UI

Database: SQLite with SQLAlchemy ORM

Authentication: Password hashing with werkzeug, JWT tokens with 24-hour expiration

Tests: Comprehensive pytest suite with 8+ auth tests and 8+ task API tests

Documentation: README with API examples

✅ Updated follow-up prompt template to match the new complex task with detailed checklists

✅ Ran end-to-end verification - The example ran successfully twice:

Run 1: Critic score 0.679 (67.9%) - passed threshold

Run 2: Critic score 0.803 (80.3%) - passed threshold

✅ Fixed minor issues:

Removed workspace cleanup that failed on non-empty directory

Fixed line length lint error

Verification

The changes have been committed and pushed to the feature/iterative-refinement-critic-example branch. The PR #1879 has been updated with the new commit.

The new complex task is significantly more challenging than the previous calculator app, requiring the agent to:

Build a complete Flask backend with authentication

Create a responsive frontend with JavaScript

Implement proper security (password hashing, JWT tokens)

Write comprehensive tests

Create documentation

This better demonstrates the value of the critic model for iterative refinement, as agents are more likely to need multiple iterations to complete such a complex task.

View full conversation

- Replace simple calculator_app task with a full-stack task manager application requiring frontend, backend, and authentication - Convert get_initial_task_prompt() function to INITIAL_TASK_PROMPT multi-line string constant as requested - Update follow-up prompt template to match the new complex task - Remove workspace cleanup that failed on non-empty directory The new task is more challenging and better demonstrates the value of the critic model for iterative refinement: - Flask backend with JWT authentication - HTML/CSS/JS frontend - SQLite database with SQLAlchemy - Comprehensive pytest test suite - Full documentation Tested end-to-end with critic scores of 0.679 and 0.803 (both exceeding the 60% success threshold). Co-authored-by: openhands <[email protected]>

- Replace simple calculator_app with full-stack task manager requiring: - Flask backend with 9 API endpoints - HTML/CSS/JS frontend with filtering and sorting - User authentication with JWT and password hashing - Task categories with custom colors - Task priority levels (low, medium, high) - 14+ comprehensive tests across 3 test files - Convert get_initial_task_prompt() to INITIAL_TASK_PROMPT multi-line string - Update follow-up prompt template with detailed checklist - Raise success threshold to 70% to encourage iterative refinement - Remove workspace cleanup that failed on non-empty directory The more complex task better demonstrates the value of the critic model for iterative refinement, as the agent may need multiple iterations to complete all requirements. Tested end-to-end: agent completed task with scores ranging from 0.67-0.80 depending on iteration, demonstrating the iterative refinement loop. Co-authored-by: openhands <[email protected]>

xingyaoww commented Feb 3, 2026

View reviewed changes

AGENTS.md Outdated Show resolved Hide resolved

xingyaoww added 3 commits February 3, 2026 23:58

Apply suggestion from @xingyaoww

a36654c

update script

76f19f1

clean up workspace

81e85b8

xingyaoww commented Feb 3, 2026

View reviewed changes

openhands-agent added 2 commits February 3, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): update critic example to demonstrate iterative refinement#1879

feat(examples): update critic example to demonstrate iterative refinement#1879
xingyaoww wants to merge 6 commits intomainfrom
feature/iterative-refinement-critic-example

xingyaoww commented Feb 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

xingyaoww Feb 3, 2026

Uh oh!

openhands-ai bot Feb 3, 2026

Uh oh!

openhands-ai bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xingyaoww commented Feb 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Updated Critic Example (examples/01_standalone_sdk/34_critic_example.py)

Updated AGENTS.md

Testing

Example Output

Related Documentation

Uh oh!

Uh oh!

xingyaoww Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Summary

Changes Made

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xingyaoww commented Feb 2, 2026 •

edited by github-actions bot

Loading

Updated Critic Example (`examples/01_standalone_sdk/34_critic_example.py`)