feat(examples): update critic example to demonstrate iterative refinement#1879
feat(examples): update critic example to demonstrate iterative refinement#1879
Conversation
…ment Update 34_critic_example.py to demonstrate: - Using the critic model to evaluate agent task completion - Capturing critic results via callbacks - Iterating with follow-up prompts when critic scores are low - Complex multi-step task (creating a Python project structure) Also add RUNNING_EXAMPLES section to AGENTS.md documenting how to run examples with LLM_API_KEY. Co-authored-by: openhands <[email protected]>
| def get_initial_task_prompt() -> str: | ||
| """Generate a complex, multi-step task prompt. | ||
|
|
||
| This task is designed to be challenging enough that the agent may | ||
| not complete it perfectly on the first try, demonstrating the | ||
| value of the critic in guiding improvements. | ||
| """ | ||
| return ( | ||
| "Please help me create a complete Python project structure " | ||
| "with the following:\n\n" | ||
| "1. Create a directory called 'calculator_app'\n\n" | ||
| "2. Inside calculator_app, create these files:\n" | ||
| " - __init__.py (empty)\n" | ||
| " - calculator.py - A Calculator class with add, subtract, " | ||
| "multiply, divide methods\n" | ||
| " - utils.py - Helper functions for input validation\n" | ||
| " - main.py - CLI entry point that takes two numbers and an operation\n\n" | ||
| "3. Create a tests/ directory with:\n" | ||
| " - __init__.py (empty)\n" | ||
| " - test_calculator.py - Unit tests for the Calculator class " | ||
| "(at least 4 tests)\n\n" | ||
| "4. Create a README.md with:\n" | ||
| " - Project title and description\n" | ||
| " - Installation instructions\n" | ||
| " - Usage examples\n" | ||
| " - Test running instructions\n\n" | ||
| "5. Run the tests to verify everything works correctly\n\n" | ||
| "Please complete ALL of these steps. After you're done, verify each " | ||
| "file exists and the tests pass." | ||
| ) |
There was a problem hiding this comment.
@OpenHands
Write this directly as a multi-line python string rather than a function.
Can you make this a MORE complex task (eg a system that requires a frontend, backend, login authentication, etc) so agent is more likely to fail.
Make sure you run end-to-end and make eveyrthing work before report back.
There was a problem hiding this comment.
I'm on it! xingyaoww can track my progress at all-hands.dev
There was a problem hiding this comment.
Summary
I've successfully updated the critic example (examples/01_standalone_sdk/34_critic_example.py) as requested:
Changes Made
✅ Converted get_initial_task_prompt() to a multi-line string constant
- Replaced the function with
INITIAL_TASK_PROMPTas a direct multi-line Python string
✅ Made the task MORE complex (full-stack application)
- Backend (Flask API): 6 REST endpoints with JWT authentication
- User registration and login
- Task CRUD operations (create, read, update, delete)
- Frontend (HTML/CSS/JS): Login/register forms, task management UI
- Database: SQLite with SQLAlchemy ORM
- Authentication: Password hashing with werkzeug, JWT tokens with 24-hour expiration
- Tests: Comprehensive pytest suite with 8+ auth tests and 8+ task API tests
- Documentation: README with API examples
✅ Updated follow-up prompt template to match the new complex task with detailed checklists
✅ Ran end-to-end verification - The example ran successfully twice:
- Run 1: Critic score 0.679 (67.9%) - passed threshold
- Run 2: Critic score 0.803 (80.3%) - passed threshold
✅ Fixed minor issues:
- Removed workspace cleanup that failed on non-empty directory
- Fixed line length lint error
Verification
The changes have been committed and pushed to the feature/iterative-refinement-critic-example branch. The PR #1879 has been updated with the new commit.
The new complex task is significantly more challenging than the previous calculator app, requiring the agent to:
- Build a complete Flask backend with authentication
- Create a responsive frontend with JavaScript
- Implement proper security (password hashing, JWT tokens)
- Write comprehensive tests
- Create documentation
This better demonstrates the value of the critic model for iterative refinement, as agents are more likely to need multiple iterations to complete such a complex task.
- Replace simple calculator_app task with a full-stack task manager application requiring frontend, backend, and authentication - Convert get_initial_task_prompt() function to INITIAL_TASK_PROMPT multi-line string constant as requested - Update follow-up prompt template to match the new complex task - Remove workspace cleanup that failed on non-empty directory The new task is more challenging and better demonstrates the value of the critic model for iterative refinement: - Flask backend with JWT authentication - HTML/CSS/JS frontend - SQLite database with SQLAlchemy - Comprehensive pytest test suite - Full documentation Tested end-to-end with critic scores of 0.679 and 0.803 (both exceeding the 60% success threshold). Co-authored-by: openhands <[email protected]>
- Replace simple calculator_app with full-stack task manager requiring: - Flask backend with 9 API endpoints - HTML/CSS/JS frontend with filtering and sorting - User authentication with JWT and password hashing - Task categories with custom colors - Task priority levels (low, medium, high) - 14+ comprehensive tests across 3 test files - Convert get_initial_task_prompt() to INITIAL_TASK_PROMPT multi-line string - Update follow-up prompt template with detailed checklist - Raise success threshold to 70% to encourage iterative refinement - Remove workspace cleanup that failed on non-empty directory The more complex task better demonstrates the value of the critic model for iterative refinement, as the agent may need multiple iterations to complete all requirements. Tested end-to-end: agent completed task with scores ranging from 0.67-0.80 depending on iteration, demonstrating the iterative refinement loop. Co-authored-by: openhands <[email protected]>
Summary
Updates
34_critic_example.pyto demonstrate iterative refinement with the critic model, showing how to shepherd an agent through complex, multi-step tasks.What Changed
Updated Critic Example (
examples/01_standalone_sdk/34_critic_example.py)The example now demonstrates:
Complex Multi-Step Task: Instead of a simple file creation, the agent is asked to create a complete Python project structure with:
calculator_apppackage with multiple modulesCritic Result Collection: Uses callbacks to capture critic evaluations in real-time:
Iterative Refinement Loop: When the critic score is below the threshold (60% by default), sends follow-up prompts to guide the agent:
Configurable Parameters:
CRITIC_SUCCESS_THRESHOLD(default: 0.6)MAX_ITERATIONS(default: 3)Updated AGENTS.md
Added a
<RUNNING_EXAMPLES>section documenting how to run SDK examples withLLM_API_KEY:Testing
Verified the example runs successfully:
Example Output
Related Documentation
@xingyaoww can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bd4e827-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bd4e827-python) is a multi-arch manifest supporting both amd64 and arm64bd4e827-python-amd64) are also available if needed