Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
e55db81
init commit for webarena verified
NicolasAG Sep 16, 2025
e8a5594
upd Makefile
NicolasAG Sep 17, 2025
8f33d10
adding the basic files
NicolasAG Sep 17, 2025
02dfc57
update dependencies
NicolasAG Sep 19, 2025
48acaeb
start adding integration with wa_verified
NicolasAG Sep 22, 2025
5be792d
upd readme
NicolasAG Sep 23, 2025
aae906c
use custom backend for webarena_verified
NicolasAG Sep 23, 2025
2b04c7d
pass the wa instance to the evaluator
NicolasAG Sep 23, 2025
0d7e8dc
pass the wa instance to the evaluator
NicolasAG Sep 23, 2025
b57c0f8
cleanup evaluator
NicolasAG Sep 26, 2025
0330f72
remove custom webarena verified instance
NicolasAG Oct 3, 2025
ab0437b
update requirements to latest wav code
NicolasAG Oct 3, 2025
bd43467
use simpler and cleaner wav eval
NicolasAG Oct 3, 2025
fecedb1
enable tracing
NicolasAG Oct 16, 2025
8fdebe6
fix wav
NicolasAG Oct 20, 2025
4bdfa7e
update to new webarena verified version
NicolasAG Oct 22, 2025
e59f754
update task name template to webarena_verified.templateID.taskID
NicolasAG Oct 22, 2025
5b05044
fix config
NicolasAG Oct 23, 2025
56574eb
fix csv file
NicolasAG Oct 23, 2025
81f930c
add webarena_verified backend
NicolasAG Oct 25, 2025
cbca5a2
fix wav tasks
NicolasAG Oct 25, 2025
8d4381b
do not check reachable if url is todo
NicolasAG Oct 25, 2025
63b4b07
fix tmp trace creation, update goal to prompt model to satisfy wav re…
NicolasAG Oct 27, 2025
b7f847a
create webarena_verified action space with special submit function to…
NicolasAG Oct 28, 2025
3e6b5b7
look for extra header file path in environment variable
NicolasAG Oct 29, 2025
525fd3b
undo special action set for webarena_verified
NicolasAG Oct 31, 2025
4272b5e
remove wav actions
NicolasAG Oct 31, 2025
fea25ed
load extra context headers for webarena(+lite)
NicolasAG Nov 3, 2025
fc090f0
update README
NicolasAG Nov 5, 2025
377dcca
update requirements
NicolasAG Nov 5, 2025
1f02f3f
update makefile and readme
NicolasAG Nov 5, 2025
f86a2b3
update readme
NicolasAG Nov 6, 2025
2bf2539
Merge remote-tracking branch 'origin/main' into wa_verified
NicolasAG Nov 6, 2025
df3bfa4
update requirements
NicolasAG Nov 6, 2025
bf6cd9a
update readme
NicolasAG Nov 6, 2025
76ab14e
update test
NicolasAG Nov 6, 2025
eae4152
black formater
NicolasAG Nov 6, 2025
afdf218
upd makefile
NicolasAG Nov 10, 2025
c3814bf
update to new webarena_verified dataset version
NicolasAG Nov 10, 2025
c0c0814
small debug
NicolasAG Nov 10, 2025
d7dc845
add massage of shopping_admin tasks
NicolasAG Nov 13, 2025
f7363c8
Merge remote-tracking branch 'origin/main' into wa_verified
NicolasAG Dec 2, 2025
b8a666a
assume all endpoints are running
NicolasAG Dec 2, 2025
49506a7
update to latest version before the public release
NicolasAG Dec 2, 2025
f326bbf
update instructions to fetch latest version before the public release
NicolasAG Dec 2, 2025
ced1021
exponential backoff
NicolasAG Dec 4, 2025
106a685
update README
NicolasAG Dec 4, 2025
0019f4e
compare json with the one in the library
NicolasAG Dec 4, 2025
e02a299
update install instructions
NicolasAG Dec 4, 2025
045d0e4
update makefile
NicolasAG Dec 9, 2025
5435db3
update pypi deployment with webarena-verified
amanjaiswal73892 Dec 9, 2025
e5c75ca
fix assets directory
amanjaiswal73892 Dec 9, 2025
c2d1536
fix task id template
NicolasAG Dec 12, 2025
75738c4
Merge branch 'wa_verified' of github.com:ServiceNow/BrowserGym into w…
NicolasAG Dec 12, 2025
55a57b0
remove task json file, use the one from the webarena-verified library…
NicolasAG Dec 12, 2025
cf11699
remove metadata and create it dynamically
NicolasAG Dec 15, 2025
29ce81b
do not hardcode revision number
NicolasAG Dec 15, 2025
4b73bb4
fix
NicolasAG Dec 15, 2025
ed6d668
run black formater
NicolasAG Dec 15, 2025
89b6460
fix format?
NicolasAG Dec 15, 2025
333d368
always create the metadata file
NicolasAG Dec 15, 2025
6535641
version-bump-dev
amanjaiswal73892 Dec 15, 2025
84c1246
Remove git dependency and add ins to install from source
amanjaiswal73892 Dec 15, 2025
ddeb2e7
version-bump-dev 0.14.3.dev3
amanjaiswal73892 Dec 15, 2025
0989834
Merge branch 'main' into wa_verified
amanjaiswal73892 Dec 16, 2025
e61b022
add webarena-verified package as a dependency
amanjaiswal73892 Jan 8, 2026
731852c
version-bump-dev 0.14.3.dev4
amanjaiswal73892 Jan 8, 2026
4367cc7
add webarena-verified in the dev requirements.txt
amanjaiswal73892 Jan 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,13 @@ jobs:

- name: Build a binary wheel and a source tarball (browsergym-webarena)
run: python3 -m build browsergym/webarena/ --outdir dist/

- name: Build a binary wheel and a source tarball (browsergym-webarenalite)
run: python3 -m build browsergym/webarenalite/ --outdir dist/
run: python3 -m build browsergym/webarenalite/ --outdir dist/

- name: Build a binary wheel and a source tarball (browsergym-webarena-verified)
run: python3 -m build browsergym/webarena_verified/ --outdir dist

- name: Build a binary wheel and a source tarball (browsergym-visualwebarena)
run: python3 -m build browsergym/visualwebarena/ --outdir dist/

Expand Down
14 changes: 7 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ clean-miniwob:

help:
@echo "Available targets:"
@echo " install - Install project dependencies"
@echo " setup-miniwob - Setup MiniWoB++ dependencies"
@echo " install-demo - Install demo dependencies"
@echo " demo - Run demo agent"
@echo " test-core - Run core tests"
@echo " clean-miniwob - Remove MiniWoB++ directory"
@echo " help - Show this help message"
@echo " install - Install project dependencies"
@echo " setup-miniwob - Setup MiniWoB++ dependencies"
@echo " install-demo - Install demo dependencies"
@echo " demo - Run demo agent"
@echo " test-core - Run core tests"
@echo " clean-miniwob - Remove MiniWoB++ directory"
@echo " help - Show this help message"

.PHONY: install setup-miniwob install-demo demo test-core clean-miniwob help
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ _Example of a GPT4-V agent executing openended tasks (top row, chat interactive)
BrowserGym includes the following benchmarks by default:
- [MiniWoB](https://miniwob.farama.org/)
- [WebArena](https://webarena.dev/)
- [WebArenaVerified](https://github.com/ServiceNow/platform-labs-webarena-verified)
- [VisualWebArena](https://jykoh.com/vwa)
- [WorkArena](https://github.com/ServiceNow/WorkArena)
- [AssistantBench](https://github.com/oriyor/assistantbench)
Expand All @@ -55,6 +56,7 @@ pip install browsergym-experiments # experiment utilities (agent, loop, benchma
pip install browsergym-core # core functionalities only (no benchmark, just the openended task)
pip install browsergym-miniwob # core + miniwob
pip install browsergym-webarena # core + webarena
pip install browsergym-webarena-verified # core + webarena_verified
pip install browsergym-visualwebarena # core + visualwebarena
pip install browsergym-workarena # core + workarena
pip install browsergym-assistantbench # core + assistantbench
Expand All @@ -69,6 +71,7 @@ playwright install chromium
Finally, each benchmark comes with its own specific setup that requires to follow additional steps.
- for MiniWoB++, see [miniwob/README.md](browsergym/miniwob/README.md)
- for WebArena, see [webarena/README.md](browsergym/webarena/README.md)
- for WebArenaVerified, see [webarena_verified/README.md](browsergym/webarena_verified/README.md)
- for VisualWebArena, see [visualwebarena/README.md](browsergym/visualwebarena/README.md)
- for WorkArena, see [WorkArena](https://github.com/ServiceNow/WorkArena)
- for AssistantBench, see [assistantbench/README.md](browsergym/assistantbench/README.md)
Expand Down
2 changes: 1 addition & 1 deletion browsergym/assistantbench/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
browsergym-core==0.14.3.dev1
browsergym-core==0.14.3.dev4
datasets
scipy
numpy
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def _normalize_number(text: str) -> str:


def _answer_to_bags(
answer: Union[str, List[str], Tuple[str, ...]]
answer: Union[str, List[str], Tuple[str, ...]],
) -> Tuple[List[str], List[Set[str]]]:
if isinstance(answer, (list, tuple)):
raw_spans = answer
Expand Down
2 changes: 1 addition & 1 deletion browsergym/core/src/browsergym/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.14.3.dev1"
__version__ = "0.14.3.dev4"

import playwright.sync_api

Expand Down
2 changes: 1 addition & 1 deletion browsergym/experiments/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
browsergym-core==0.14.3.dev1
browsergym-core==0.14.3.dev4
tiktoken>=0.4
dataclasses-json
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,13 @@ def make_action_set(self):


BenchmarkBackend = Literal[
"miniwob", "webarena", "visualwebarena", "workarena", "assistantbench", "weblinx"
"miniwob",
"webarena",
"webarena_verified",
"visualwebarena",
"workarena",
"assistantbench",
"weblinx",
]


Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import numpy as np

from browsergym.experiments.benchmark.metadata.utils import (
task_list_from_metadata,
task_metadata,
Expand Down Expand Up @@ -132,6 +133,21 @@
),
task_metadata=task_metadata("webarena"),
),
"webarena_verified": lambda n_repeats=1: Benchmark(
name="webarena_verified",
high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["webarena"],
is_multi_tab=True,
supports_parallel_seeds=False,
backends=["webarena_verified"],
env_args_list=make_env_args_list_from_repeat_tasks(
task_list=task_list_from_metadata(metadata=task_metadata("webarena_verified")),
max_steps=30,
n_repeats=n_repeats,
seeds_rng=np.random.RandomState(42),
),
task_metadata=task_metadata("webarena_verified"),
), # TODO: Add webarena-verified hard subsets by filtering tasks in
# https://github.com/ServiceNow/webarena-verified/blob/main/assets/dataset/subsets/webarena-verified-hard.json
"webarena_lite": lambda n_repeats=1: Benchmark(
name="webarena_lite",
high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["webarena"],
Expand Down Expand Up @@ -252,7 +268,8 @@
backends=["assistantbench"],
env_args_list=make_env_args_list_from_repeat_tasks(
task_list=task_list_from_metadata(
metadata=task_metadata("assistantbench"), filter={"browsergym_split": "valid|test"}
metadata=task_metadata("assistantbench"),
filter={"browsergym_split": "valid|test"},
),
max_steps=30,
n_repeats=n_repeats,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
import csv
import importlib.resources
import io
import json
import os
import pkgutil
from collections import defaultdict
from copy import deepcopy
Expand All @@ -9,7 +13,110 @@
from browsergym.experiments.loop import EnvArgs


def make_webarena_verified_metadata():
"""
Creates the webarena_verified.csv metadata file based on the original webarena.csv file and the webarena-verified.json file in the webarena-verified library.
"""
# Load the json file from the webarena-verified library
data = json.loads(
importlib.resources.files("webarena_verified")
.joinpath("assets/dataset/webarena-verified.json")
.read_text()
)
# Create a mapping from task_id to intent_template_id and revision for efficient lookup. This is used to find the dependency task name.
task_id_to_template_id = {task["task_id"]: task["intent_template_id"] for task in data}
task_id_to_revision = {task["task_id"]: task["revision"] for task in data}

# Read the original webarena.csv and create a mapping from task_id to original task info
original_csv_path = os.path.join(os.path.dirname(__file__), "webarena.csv")
original_tasks = {}
with open(original_csv_path, "r") as f:
reader = csv.DictReader(f)
for row in reader:
task_id = int(row["task_id"])
original_tasks[task_id] = {
"requires_reset": row["requires_reset"],
"sites": row["sites"],
"eval_types": row["eval_types"],
"browsergym_split": row["browsergym_split"],
"depends_on": row["depends_on"],
}

# Create CSV data
csv_data = []
for task in data:
intent_template_id = task["intent_template_id"]
task_id = task["task_id"]
revision = task["revision"]

# Extract eval_types
new_eval_types = []
for evaluator_config in task.get("eval", []):
new_eval_types.append(evaluator_config["evaluator"])
assert len(new_eval_types) > 0, f"Task {task_id} has no evaluators"
new_eval_types_str = " ".join(new_eval_types)

# Extract new task sites
sites = task.get("sites", [])
sites_str = " ".join(sites) if sites else ""

# Get original task data for comparison and dependency copying
original_task = original_tasks.get(task_id, {})

# Assert that new task sites matches the original task sites
original_sites_str = original_task.get("sites", "")
assert (
sites_str == original_sites_str
), f"Task {task_id}: sites mismatch - JSON: {sites_str}, CSV: {original_sites_str}"

# Construct the dependency task name
if original_dependency := original_task.get("depends_on"):
dependency_task_id = int(original_dependency.split(".")[-1])
dependency_template_id = task_id_to_template_id[dependency_task_id]
dependency_revision = task_id_to_revision[dependency_task_id]
dependency_task_name = f"webarena_verified.{dependency_template_id}.{dependency_task_id}.{dependency_revision}"
else:
dependency_task_name = ""

# Create metadata row
row = {
"task_name": f"webarena_verified.{intent_template_id}.{task_id}.{revision}",
"requires_reset": str(
original_task.get("requires_reset", False)
), # copy original requires_reset
"sites": sites_str,
"eval_types": new_eval_types_str,
"task_id": str(task_id),
"browsergym_split": original_task.get(
"browsergym_split", "train"
), # copy original browsergym_split
"depends_on": dependency_task_name,
}
csv_data.append(row)

# Write CSV file
output_path = os.path.join(os.path.dirname(__file__), "webarena_verified.csv")
with open(output_path, "w", newline="") as f:
fieldnames = [
"task_name",
"requires_reset",
"sites",
"eval_types",
"task_id",
"browsergym_split",
"depends_on",
]
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(csv_data)

print(f"Created {output_path} with {len(csv_data)} tasks")


def task_metadata(benchmark_name: str):
if benchmark_name == "webarena_verified":
make_webarena_verified_metadata()

return task_metadata_from_csv(
io.StringIO(pkgutil.get_data(__name__, f"{benchmark_name}.csv").decode("utf-8"))
)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import importlib.resources
import json
import logging
import multiprocessing as mp
import os
Expand All @@ -6,6 +8,7 @@
from typing import Literal

import numpy as np

from browsergym.experiments.loop import SEED_MAX, EnvArgs

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -103,6 +106,27 @@ def make_env_args_list_from_fixed_seeds(
return env_args_list


def get_webarena_verified_task_name(intent_template_id: int, task_id: int) -> str:
"""
Returns the task name (with revision) for a given intent template id and task id.
"""
# Load the json file from the webarena-verified library
data = json.loads(
importlib.resources.files("webarena_verified")
.joinpath("assets/dataset/webarena-verified.json")
.read_text()
)
for task in data:
if task["intent_template_id"] == intent_template_id and task["task_id"] == task_id:
revision = task["revision"]
break
else:
raise ValueError(
f"No task found for intent template id {intent_template_id} and task id {task_id} in webarena-verified.json"
)
return f"webarena_verified.{intent_template_id}.{task_id}.{revision}"


def prepare_backend(backend: str):
match backend:
case "miniwob":
Expand Down Expand Up @@ -141,6 +165,35 @@ def prepare_backend(backend: str):
]
)

case "webarena_verified":
# register environments
import browsergym.webarena_verified

# full reset the instance (requires environment variables properly set up)
from browsergym.webarena.instance import WebArenaInstance

default_instance = WebArenaInstance()
default_instance.full_reset()

logging.info(
f"Initiating WebArena instance warm-up. Some tasks will be pre-loaded (massaged) to trigger some caching mechanisms and make the server more responsive."
)
massage_tasks(
[
get_webarena_verified_task_name(intent_template_id, task_id)
for intent_template_id, task_id in [
(23, 410), # reddit
(330, 533), # gitlab
(87, 561), # gitlab wiki
(88, 562), # gitlab reddit
(165, 574), # shopping
(16, 640), # reddit
(253, 680), # shopping_admin
(94, 740), # wiki map
]
]
)

case "visualwebarena":
# register environments
import browsergym.visualwebarena
Expand Down
1 change: 1 addition & 0 deletions browsergym/experiments/src/browsergym/experiments/loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -937,6 +937,7 @@ def _get_env_name(task_name: str):
elif task_name.startswith("webarena"):
import browsergym.webarena
import browsergym.webarenalite
import browsergym.webarena_verified
elif task_name.startswith("visualwebarena"):
import browsergym.visualwebarena
elif task_name.startswith("assistantbench"):
Expand Down
2 changes: 1 addition & 1 deletion browsergym/miniwob/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
browsergym-core==0.14.3.dev1
browsergym-core==0.14.3.dev4
18 changes: 10 additions & 8 deletions browsergym/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ authors = [
{name = "Thibault Le Sellier De Chezelles"},
{name = "Tom Marty"},
{name = "Aman Jaiswal"},
{name = "Nicolas Gontier"},
]
readme = "README.md"
requires-python = ">3.10"
Expand All @@ -28,17 +29,18 @@ classifiers = [
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"License :: OSI Approved :: Apache Software License",
]
version="0.14.3.dev1"
version="0.14.3.dev4"
dependencies = [
"browsergym-core==0.14.3.dev1",
"browsergym-miniwob==0.14.3.dev1",
"browsergym-webarena==0.14.3.dev1",
"browsergym-visualwebarena==0.14.3.dev1",
"browsergym-assistantbench==0.14.3.dev1",
"browsergym-experiments==0.14.3.dev1",
"browsergym-core==0.14.3.dev4",
"browsergym-miniwob==0.14.3.dev4",
"browsergym-webarena==0.14.3.dev4",
"browsergym-visualwebarena==0.14.3.dev4",
"browsergym-assistantbench==0.14.3.dev4",
"browsergym-experiments==0.14.3.dev4",
"browsergym-workarena>=0.4.1",
"weblinx-browsergym>=0.0.2",
"browsergym-webarenalite==0.14.3.dev1"
"browsergym-webarenalite==0.14.3.dev4",
"browsergym-webarena-verified==0.14.3.dev4"
]

[tool.setuptools]
Expand Down
2 changes: 1 addition & 1 deletion browsergym/visualwebarena/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
browsergym-core==0.14.3.dev1
browsergym-core==0.14.3.dev4
browsergym-webarena
libvisualwebarena==0.0.15
requests
Expand Down
2 changes: 1 addition & 1 deletion browsergym/webarena/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
browsergym-core==0.14.3.dev1
browsergym-core==0.14.3.dev4
libwebarena==0.0.4
Loading