Skip to content

feat(examples): update critic example to demonstrate iterative refinement#1879

Draft
xingyaoww wants to merge 6 commits intomainfrom
feature/iterative-refinement-critic-example
Draft

feat(examples): update critic example to demonstrate iterative refinement#1879
xingyaoww wants to merge 6 commits intomainfrom
feature/iterative-refinement-critic-example

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Feb 2, 2026

Summary

Updates 34_critic_example.py to demonstrate iterative refinement with the critic model, showing how to shepherd an agent through complex, multi-step tasks.

What Changed

Updated Critic Example (examples/01_standalone_sdk/34_critic_example.py)

The example now demonstrates:

  1. Complex Multi-Step Task: Instead of a simple file creation, the agent is asked to create a complete Python project structure with:

    • A calculator_app package with multiple modules
    • Unit tests with at least 4 test cases
    • A README with documentation
    • Running and verifying tests pass
  2. Critic Result Collection: Uses callbacks to capture critic evaluations in real-time:

    class CriticResultCollector:
        def callback(self, event: Event) -> None:
            if isinstance(event, (ActionEvent, MessageEvent)):
                if event.critic_result is not None:
                    # Capture and process critic feedback
  3. Iterative Refinement Loop: When the critic score is below the threshold (60% by default), sends follow-up prompts to guide the agent:

    • Extracts potential issues from critic metadata
    • Crafts specific guidance based on what's incomplete
    • Continues until success threshold is met or max iterations reached
  4. Configurable Parameters:

    • CRITIC_SUCCESS_THRESHOLD (default: 0.6)
    • MAX_ITERATIONS (default: 3)

Updated AGENTS.md

Added a <RUNNING_EXAMPLES> section documenting how to run SDK examples with LLM_API_KEY:

LLM_BASE_URL="https://llm-proxy.eval.all-hands.dev" LLM_API_KEY="$LLM_API_KEY" \
  uv run python examples/01_standalone_sdk/<example_name>.py

Testing

Verified the example runs successfully:

  • Agent completed the task with critic score of 0.758 (75.8%)
  • Success threshold (60%) was met on first iteration
  • All expected files were created and tests passed

Example Output

======================================================================
🚀 Starting Iterative Refinement with Critic Model
======================================================================
Success threshold: 60%
Max iterations: 2

--- Iteration 1: Initial Task ---
...
📊 Critic Score: 0.758
   Details: Success: 0.76; {"success": 0.7577772736549377, ...

📈 Iteration 1 final score: 0.758
✅ Success threshold (60%) met!

======================================================================
📊 ITERATIVE REFINEMENT COMPLETE
======================================================================
Total iterations: 1
Final critic score: 0.758
Success: ✅ YES

Related Documentation

@xingyaoww can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bd4e827-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bd4e827-python \
  ghcr.io/openhands/agent-server:bd4e827-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bd4e827-golang-amd64
ghcr.io/openhands/agent-server:bd4e827-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bd4e827-golang-arm64
ghcr.io/openhands/agent-server:bd4e827-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bd4e827-java-amd64
ghcr.io/openhands/agent-server:bd4e827-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bd4e827-java-arm64
ghcr.io/openhands/agent-server:bd4e827-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bd4e827-python-amd64
ghcr.io/openhands/agent-server:bd4e827-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:bd4e827-python-arm64
ghcr.io/openhands/agent-server:bd4e827-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:bd4e827-golang
ghcr.io/openhands/agent-server:bd4e827-java
ghcr.io/openhands/agent-server:bd4e827-python

About Multi-Architecture Support

  • Each variant tag (e.g., bd4e827-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., bd4e827-python-amd64) are also available if needed

…ment

Update 34_critic_example.py to demonstrate:
- Using the critic model to evaluate agent task completion
- Capturing critic results via callbacks
- Iterating with follow-up prompts when critic scores are low
- Complex multi-step task (creating a Python project structure)

Also add RUNNING_EXAMPLES section to AGENTS.md documenting how to run
examples with LLM_API_KEY.

Co-authored-by: openhands <[email protected]>
Comment on lines 100 to 129
def get_initial_task_prompt() -> str:
"""Generate a complex, multi-step task prompt.

This task is designed to be challenging enough that the agent may
not complete it perfectly on the first try, demonstrating the
value of the critic in guiding improvements.
"""
return (
"Please help me create a complete Python project structure "
"with the following:\n\n"
"1. Create a directory called 'calculator_app'\n\n"
"2. Inside calculator_app, create these files:\n"
" - __init__.py (empty)\n"
" - calculator.py - A Calculator class with add, subtract, "
"multiply, divide methods\n"
" - utils.py - Helper functions for input validation\n"
" - main.py - CLI entry point that takes two numbers and an operation\n\n"
"3. Create a tests/ directory with:\n"
" - __init__.py (empty)\n"
" - test_calculator.py - Unit tests for the Calculator class "
"(at least 4 tests)\n\n"
"4. Create a README.md with:\n"
" - Project title and description\n"
" - Installation instructions\n"
" - Usage examples\n"
" - Test running instructions\n\n"
"5. Run the tests to verify everything works correctly\n\n"
"Please complete ALL of these steps. After you're done, verify each "
"file exists and the tests pass."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands
Write this directly as a multi-line python string rather than a function.

Can you make this a MORE complex task (eg a system that requires a frontend, backend, login authentication, etc) so agent is more likely to fail.

Make sure you run end-to-end and make eveyrthing work before report back.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

I've successfully updated the critic example (examples/01_standalone_sdk/34_critic_example.py) as requested:

Changes Made

Converted get_initial_task_prompt() to a multi-line string constant

  • Replaced the function with INITIAL_TASK_PROMPT as a direct multi-line Python string

Made the task MORE complex (full-stack application)

  • Backend (Flask API): 6 REST endpoints with JWT authentication
    • User registration and login
    • Task CRUD operations (create, read, update, delete)
  • Frontend (HTML/CSS/JS): Login/register forms, task management UI
  • Database: SQLite with SQLAlchemy ORM
  • Authentication: Password hashing with werkzeug, JWT tokens with 24-hour expiration
  • Tests: Comprehensive pytest suite with 8+ auth tests and 8+ task API tests
  • Documentation: README with API examples

Updated follow-up prompt template to match the new complex task with detailed checklists

Ran end-to-end verification - The example ran successfully twice:

  • Run 1: Critic score 0.679 (67.9%) - passed threshold
  • Run 2: Critic score 0.803 (80.3%) - passed threshold

Fixed minor issues:

  • Removed workspace cleanup that failed on non-empty directory
  • Fixed line length lint error

Verification

The changes have been committed and pushed to the feature/iterative-refinement-critic-example branch. The PR #1879 has been updated with the new commit.

The new complex task is significantly more challenging than the previous calculator app, requiring the agent to:

  1. Build a complete Flask backend with authentication
  2. Create a responsive frontend with JavaScript
  3. Implement proper security (password hashing, JWT tokens)
  4. Write comprehensive tests
  5. Create documentation

This better demonstrates the value of the critic model for iterative refinement, as agents are more likely to need multiple iterations to complete such a complex task.

View full conversation

- Replace simple calculator_app task with a full-stack task manager
  application requiring frontend, backend, and authentication
- Convert get_initial_task_prompt() function to INITIAL_TASK_PROMPT
  multi-line string constant as requested
- Update follow-up prompt template to match the new complex task
- Remove workspace cleanup that failed on non-empty directory

The new task is more challenging and better demonstrates the value
of the critic model for iterative refinement:
- Flask backend with JWT authentication
- HTML/CSS/JS frontend
- SQLite database with SQLAlchemy
- Comprehensive pytest test suite
- Full documentation

Tested end-to-end with critic scores of 0.679 and 0.803 (both
exceeding the 60% success threshold).

Co-authored-by: openhands <[email protected]>
- Replace simple calculator_app with full-stack task manager requiring:
  - Flask backend with 9 API endpoints
  - HTML/CSS/JS frontend with filtering and sorting
  - User authentication with JWT and password hashing
  - Task categories with custom colors
  - Task priority levels (low, medium, high)
  - 14+ comprehensive tests across 3 test files

- Convert get_initial_task_prompt() to INITIAL_TASK_PROMPT multi-line string
- Update follow-up prompt template with detailed checklist
- Raise success threshold to 70% to encourage iterative refinement
- Remove workspace cleanup that failed on non-empty directory

The more complex task better demonstrates the value of the critic model
for iterative refinement, as the agent may need multiple iterations to
complete all requirements.

Tested end-to-end: agent completed task with scores ranging from 0.67-0.80
depending on iteration, demonstrating the iterative refinement loop.

Co-authored-by: openhands <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants