Release v0.5.3 · groq/openbench

0.5.3 (2025-12-08)

Features

add --max-tasks option for concurrent task execution in eval command (#279) (241e653)
add bbq benchmark (#255) (46f4744)
add ChartQAPro (#289) (677f7c7)
add configurable HuggingFace Hub config naming (#261) (8abe2ae)
add DocVQA benchmark (#297) (0dd0edf)
add fuzzy match suggestion for misspelled evals (#303) (625a7b3)
add ifbench benchmark (#326) (bd730c2)
add math EvalGroup (#263) (e0f4a9b)
add MathVista benchmark (#298) (5c50a8f)
add MMLU-Redux benchmark from lighteval (#321) (d22a587)
add MMVet V2 benchmark (#296) (66689de)
add OCRBench V2 benchmark (#295) (71f3589)
add optional extras for simpleqa and toxicity (#266) (2450ddf)
add sealqa benchmark (#283) (06b39e4)
add SMT 2024 benchmarks (#239) (5d9b475)
add tau bench, pass^k metric (#294) (2bb1242)
agentdojo: port agentdojo benchmark (#223) (1cf174c)
cli: added export command to exposrt specific logs to hf (#265) (62e8d8c)
cvebench: added auto prepare env set up for cvebench (#259) (db238a3)
deepresearch-bench: add deepresearch bench (#288) (d2b4622)
docs: docs for unsupported providers (#312) (3a3d4b8)
docs: search capability benchmarks feature page (#287) (9dd27c1)
evals: add GSM8K benchmark with shared grade school math scorer (#322) (4559a67)
evals: add QA benchmarks and shared scorer (#323) (0ea3733)
factscore: added support for factscore (#258) (13aafd7)
gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284) (815f51b)
groq: implement configurable timeout for GroqAPI client (#271) (be492b6)
groq: streaming support (#313) (c1a20be)
m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222) (6b8f2b1)
PolygloToxicityPrompts: add multilingual toxicity evaluation (#262) (46de7ee)
provider: add helicone support (#275) (de6ab04)
provider: add SiliconFlow provider support (#269) (ce14070)
providers: add W&B Inference model provider (#264) (a02c34f)
rocketscience: add rocketscience benchmark support (#277) (73bcfc2)
simpleqa_verified: add SimpleQA Verified benchmark (#249) (8a512c4)
vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272) (d0eff6f)

Bug Fixes

add args to eval command (#276) (0e06988)
allow subtasks into eval group summary (#306) (ae82757)
deps: catch import warnings from optional deps (#327) (434fe88)
docs: markdown formatting issue (#314) (24af36f)
docs: reasoning-effort docs clarity (#278) (2644619)
docvqa: remove docvqa from config and dep group (#328) (162b8b5)
factscore import issues, vLLM timeout bug (#273) (1674528)
factscore: fix module level import error for optional dep (#274) (99594ff)
fix global import warning for optional dep (#307) (c44c8de)
friendliai token env name (#286) (a197828)
livemcpbench: catch errors on call_tool and route (#260) (0ab746d)
math: shorten math group (#268) (19cc66b)
refactor factscore (#300) (ab3e84e)
remove nonexistent docvqa import (#318) (90a15a2)
rename gpt_oss_aime to gpt_oss_aime25 (b378715)
run mmmu as task instead of aggregate of subsets (#315) (623fbed)
simpleqa_verified: silence mypy for optional kagglehub import (#257) (32a1ff4)
using huggingface instead of kagglehub for simpleqa_verified benchmark (#270) (8ee1efa)

Documentation

add groq configuration and embed updates email form (#301) (320a542)
add missing docstrings and type hints for code clarity (#221) (38d34a0)

Chores

fix deprecated methods for dataset loading with scripts (#267) (4c503f6)
GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (5b08987)
GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (aa1ab26)
groq: docs and tests for streaming; stream=true default (#319) (fa1a8d0)
pre-commit hook for test-registry-imports (#334) (85407e1)
prune README, move extra info to docs (#336) (f57a340)
push openbench-core to pyx (#292) (0d28e5d)
reduce math EvalGroup to most recent tasks only (420dcb9)
remove docvqa (#317) (9068c23)
rename openbench-core to openbench-core instead of openbench (#290) (2056f92)
upgrade numpy version and update uv lock (#281) (15f2dbf)

Refactor

create shared image loading utilities for multimodal tasks (#305) (905932f)
move pass^k to a custom metric rather than scorer (#310) (ed0eb8d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

0.5.3 (2025-12-08)

Features

Bug Fixes

Documentation

Chores

Refactor

Uh oh!