BrowserGym: Green and White Agent Evaluation System

A system for evaluating web automation agents on BrowserGym benchmarks. The green agent evaluates white agents on standardized web tasks, measures performance, and generates reports.

Quick Start

# Install
make install

# Run white agent on task
cd agents
python green_agent.py --task miniwob.click-test --max_steps 50

# Run tests
make test-green

Architecture

Green Agent (Evaluator)

Location: agents/green_agent.py

Purpose: Evaluate white agents on benchmark tasks

Functions:

Loads white agents and initializes benchmark environments
Executes white agents on tasks
Collects metrics (success rate, steps taken, rewards)
Generates evaluation reports
Saves results to agents/results/

Configuration:

--task: Specific task to run (e.g., miniwob.click-test)
--max_steps: Maximum steps per task (default: 50)
--model_name: LLM model to use (default: gpt-4o-mini)

Key Features:

A2A protocol compliant
Supports all BrowserGym benchmarks
Produces detailed evaluation artifacts
Can run as standalone or as A2A server

White Agent (Competitor)

Location: agents/white_agent.py

Purpose: Perform web automation tasks

Required Interface:

from browsergym.experiments import Agent
from browsergym.core.action.highlevel import HighLevelActionSet

class WhiteAgent(Agent):
    def __init__(self):
        super().__init__()
        self.action_set = HighLevelActionSet(
            subsets=["chat", "bid", "nav", "tab", "infeas"],
            strict=False,
            multiaction=False
        )
    
    def obs_preprocessor(self, obs):
        # Process observation from environment
        return {
            "axtree_txt": flatten_axtree_to_str(obs["axtree_object"]),
            "goal_object": obs["goal_object"],
            "last_action": obs["last_action"],
            # ... other observations
        }
    
    def get_action(self, obs):
        # Decide action based on observation
        # Return (action_string, metadata_dict)
        return 'click("5")', {}

Components:

Action Set - Available actions:
- bid: Click elements by browser ID (e.g., click("5"))
- nav: Navigate URLs (goto(url), go_back(), go_forward())
- tab: Manage tabs (new_tab(), tab_close(), tab_focus(index))
- chat: Send messages (send_msg_to_user(message))
- infeas: Report infeasible tasks (report_infeasible())
Observation Types:
- axtree_object: Accessibility tree (page structure)
- dom_object: HTML DOM
- screenshot: Page screenshot (optional)
- goal_object: Task description
- last_action: Previous action taken
- last_action_error: Error from previous action
- open_pages_urls: All open tabs
- open_pages_titles: Tab titles
- active_page_index: Current tab index
Configuration:
- model_name: OpenAI model (default: gpt-4o-mini)
- chat_mode: Enable chat interactions (default: False)
- use_html: Use HTML DOM (default: False)
- use_axtree: Use accessibility tree (default: True)
- use_screenshot: Use screenshots (default: False)

Installation

Basic Installation

make install

This installs:

BrowserGym core
Python dependencies
Playwright with Chromium browser

Manual Installation

python3 -m venv .gym
source .gym/bin/activate
pip install -r requirements.txt
playwright install chromium

Benchmark-Specific Setup

# MiniWoB
make install-bm-miniwob

# WebArena (requires Docker)
make install-bm-webarena

# VisualWebArena (requires Docker)
make install-bm-visualwebarena

# WorkArena (requires ServiceNow instance)
make install-bm-workarena

Running Agents

Running White Agent on Tasks

Single Task

cd agents
python green_agent.py --task miniwob.click-test --max_steps 50

Multiple Tasks

cd agents
python green_agent.py --task miniwob.click-test --max_steps 50
python green_agent.py --task miniwob.click-dialog --max_steps 50
python green_agent.py --task miniwob.choose-date --max_steps 50

Custom Configuration

cd agents
python green_agent.py \
  --task webarena.shopping.0 \
  --max_steps 100 \
  --model_name gpt-4o

Running Green Agent Evaluation

The green agent evaluates white agents by running them on tasks and measuring performance.

Evaluate on Single Task

cd agents
python green_agent.py --task miniwob.click-test --max_steps 50

Results saved to agents/results/ directory with:

Task name
Success/failure
Reward (0 or 1)
Number of steps taken
Action history
Screenshots (if enabled)

Evaluate on Benchmark Suite

cd agents

# MiniWoB suite
for task in miniwob.click-test miniwob.click-dialog miniwob.choose-date; do
    python green_agent.py --task $task --max_steps 50
done

# WebArena suite
for task in webarena.shopping.0 webarena.reddit.1 webarena.gitlab.2; do
    python green_agent.py --task $task --max_steps 100
done

Using Makefile

# Run with specific tasks
make demo TASKS="miniwob.click-test miniwob.click-dialog"

Available Tasks

MiniWoB (simple web interactions):

miniwob.click-test
miniwob.click-dialog
miniwob.choose-date
miniwob.enter-text
miniwob.login-user
And 100+ more tasks

WebArena (complex web tasks):

webarena.shopping.0 - E-commerce tasks
webarena.reddit.1 - Social media tasks
webarena.gitlab.2 - Code repository tasks
webarena.wikipedia.3 - Information retrieval

VisualWebArena (visually-grounded tasks):

visualwebarena.398
visualwebarena.1
Requires screenshot processing

AssistantBench (real-world tasks):

assistantbench.find-gym
assistantbench.compare-hotels
assistantbench.book-restaurant

Testing

Run All Green Agent Tests

make test-green

Run Specific Test Categories

# Initialization tests (fast)
pytest tests/green_evaluator/test_green_evaluator.py::TestInitialization -v

# Task evaluation tests
pytest tests/green_evaluator/test_green_evaluator.py::TestTaskEvaluation -v

# Benchmark comparison tests
pytest tests/green_evaluator/test_green_evaluator.py::TestBenchmarkComparison -v

# A2A server tests
pytest tests/green_evaluator/test_green_evaluator.py::TestA2AServer -v

Run Fast Tests Only

pytest tests/green_evaluator/ -v -m "not slow"

Test Benchmark-Specific Functionality

# Test only MiniWoB
pytest tests/miniwob/ -v

# Test only AssistantBench
pytest tests/assistantbench/ -v

# Test only installed benchmarks
make test-benchmarks

Reproducing Benchmark Results

MiniWoB Benchmark

cd agents
python green_agent.py --task miniwob.click-test
python green_agent.py --task miniwob.click-dialog
python green_agent.py --task miniwob.choose-date

Expected results:

Success rate: 80-95% on simple tasks
Average steps: 2-5 for click tasks

WebArena Benchmark

Requires Docker services running.

Setup:

make install-bm-webarena

Run:

cd agents
python green_agent.py --task webarena.shopping.0 --max_steps 100
python green_agent.py --task webarena.reddit.1 --max_steps 100

Expected results:

Success rate: 40-60% on complex tasks
Average steps: 10-30 for shopping tasks

VisualWebArena Benchmark

Requires Docker services and screenshot processing.

Setup:

make install-bm-visualwebarena

Run:

cd agents
python green_agent.py --task visualwebarena.398 --max_steps 100

AssistantBench

cd agents
python green_agent.py --task assistantbench.find-gym
python green_agent.py --task assistantbench.compare-hotels

Expected results:

Success rate: 50-70%
Average steps: 15-25

Configuration

Environment Variables

Copy sample.env to .env:

cp sample.env .env

Required Variables:

# Basic Configuration
HOSTNAME=your-hostname-or-ip
OPENAI_API_KEY=sk-your-openai-api-key

# GitHub (for Docker publishing)
GITHUB_USERNAME=yourusername
GITHUB_TOKEN=ghp_your_token

# MiniWoB
MINIWOB_URL=http://${HOSTNAME}:8080

# WebArena
WA_SHOPPING=http://${HOSTNAME}:7770
WA_SHOPPING_ADMIN=http://${HOSTNAME}:7780/admin
WA_REDDIT=http://${HOSTNAME}:9999
WA_GITLAB=http://${HOSTNAME}:8023
WA_MAP=http://${HOSTNAME}:4444
WA_WIKIPEDIA=http://${HOSTNAME}:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing
WA_HOMEPAGE=http://${HOSTNAME}:4399

# VisualWebArena
VWA_CLASSIFIEDS=http://${HOSTNAME}:9980
VWA_CLASSIFIEDS_RESET_TOKEN=4b61655535e7ed388f0d40a93600254c
VWA_SHOPPING=http://${HOSTNAME}:7770
VWA_REDDIT=http://${HOSTNAME}:9999
VWA_WIKIPEDIA=http://${HOSTNAME}:8888
VWA_HOMEPAGE=http://${HOSTNAME}:80
DATASET=visualwebarena

# WorkArena (ServiceNow)
SNOW_INSTANCE_URL=https://your-instance.service-now.com
SNOW_INSTANCE_UNAME=admin
SNOW_INSTANCE_PWD=your-password

Docker Setup

Setup WebArena Services:

make install-bm-webarena

Setup VisualWebArena Services:

make install-bm-visualwebarena

Check Services Running:

docker ps

A2A Server Mode

Both agents can run as A2A (Agent-to-Agent) protocol servers for integration with AgentBeats platform.

Start Green Agent Server

cd agents
python green_agent.py --a2a-server --port 8000 --card-url http://YOUR_HOST:8000

Endpoints:

/.well-known/agent-card.json - Agent card
/sendMessage - Send assessment request
/sendMessageStream - Stream assessment updates
/getTask - Get task status
/cancelTask - Cancel running task
/health - Health check

Start White Agent Server

cd agents
python white_agent.py --a2a-server --port 8001 --card-url http://YOUR_HOST:8001

Endpoints:

/.well-known/agent-card.json - Agent card
/sendMessage - Send task request
/getTask - Get task status
/health - Health check

With Cloudflare Tunnel

For secure HTTPS access without opening firewall ports:

Terminal 1 - Green Tunnel:

cloudflared tunnel --url http://localhost:8000
# Note the URL: https://abc-123-def.trycloudflare.com

Terminal 2 - White Tunnel:

cloudflared tunnel --url http://localhost:8001
# Note the URL: https://ghi-456-jkl.trycloudflare.com

Terminal 3 - Green Agent:

cd agents
python green_agent.py --a2a-server --port 8000 \
  --card-url https://abc-123-def.trycloudflare.com

Terminal 4 - White Agent:

cd agents
python white_agent.py --a2a-server --port 8001 \
  --card-url https://ghi-456-jkl.trycloudflare.com

Register on AgentBeats

Go to https://v2.agentbeats.org
Register green agent with URL from tunnel
Register white agent with URL from tunnel
Create assessment between them

Docker Images

Build and publish agent images:

make containerize-agents

This creates:

ghcr.io/yourusername/browsergym-green-evaluator:v1.0
ghcr.io/yourusername/browsergym-white-agent:v1.0

Results Format

Results are saved in JSON format to agents/results/ directory:

{
  "task_name": "miniwob.click-test",
  "success": true,
  "reward": 1.0,
  "steps": 3,
  "max_steps": 50,
  "model": "gpt-4o-mini",
  "details": {
    "cum_reward": 1.0,
    "n_steps": 3,
    "action_history": ["click('5')", "click('submit')"],
    "screenshots": ["step_0.png", "step_1.png"]
  }
}

Troubleshooting

Installation Issues

Missing dependencies:

pip install -r requirements.txt
playwright install chromium

Import errors:

pip install -e .

Runtime Issues

Agent fails to start:

# Check Python version (3.10+ required)
python --version

# Reinstall dependencies
pip install -r requirements.txt --force-reinstall

Task execution timeout:

# Increase max_steps
cd agents
python green_agent.py --task miniwob.click-test --max_steps 100

OpenAI API errors:

# Check API key is set
echo $OPENAI_API_KEY

# Or set directly
export OPENAI_API_KEY=sk-your-key

Benchmark Issues

Benchmark not found:

# Install benchmark
make install-bm-miniwob

Docker services not running:

# Start WebArena services
cd classifieds_workspace/classifieds_docker_compose
docker-compose up -d

# Check services
docker ps

Port conflicts:

# Check what's using port
sudo lsof -i :8000

# Kill process
sudo kill -9 PID

A2A Server Issues

Agent card not loading:

# Test locally
curl http://localhost:8000/.well-known/agent-card.json

# Test through tunnel
curl https://your-tunnel-url/.well-known/agent-card.json

Firewall blocking:

# Open ports
sudo ufw allow 8000/tcp
sudo ufw allow 8001/tcp

Cloudflare tunnel not connecting:

# Restart tunnel
pkill cloudflared
cloudflared tunnel --url http://localhost:8000

Resources

BrowserGym: https://github.com/ServiceNow/BrowserGym
AgentBeats: https://agentbeats.dev
A2A Protocol: https://a2aprotocol.org
OpenAI API: https://platform.openai.com

License

See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 830 Commits
.github/workflows		.github/workflows
agents		agents
browsergym		browsergym
dev		dev
docs		docs
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
AGENTBEATS_REGISTRATION_STEPS.md		AGENTBEATS_REGISTRATION_STEPS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
green_evaluator_card.json		green_evaluator_card.json
green_evaluator_card.toml		green_evaluator_card.toml
manage_agents.sh		manage_agents.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_benchmark_suite.py		run_benchmark_suite.py
run_green.sh		run_green.sh
run_white.sh		run_white.sh
sample.env		sample.env
scenario.toml		scenario.toml

License

layahaasini/BrowserGym

Folders and files

Latest commit

History

Repository files navigation

BrowserGym: Green and White Agent Evaluation System

Table of Contents

Quick Start

Architecture

Green Agent (Evaluator)

White Agent (Competitor)

Installation

Basic Installation

Manual Installation

Benchmark-Specific Setup

Running Agents

Running White Agent on Tasks

Single Task

Multiple Tasks

Custom Configuration

Running Green Agent Evaluation

Evaluate on Single Task

Evaluate on Benchmark Suite

Using Makefile

Available Tasks

Testing

Run All Green Agent Tests

Run Specific Test Categories

Run Fast Tests Only

Test Benchmark-Specific Functionality

Reproducing Benchmark Results

MiniWoB Benchmark

WebArena Benchmark

VisualWebArena Benchmark

AssistantBench

Configuration

Environment Variables

Docker Setup

A2A Server Mode

Start Green Agent Server

Start White Agent Server

With Cloudflare Tunnel

Register on AgentBeats

Docker Images

Results Format

Troubleshooting

Installation Issues

Runtime Issues

Benchmark Issues

A2A Server Issues

Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages