Releases
v0.5.3
Compare
Sorry, something went wrong.
No results found
0.5.3 (2025-12-08)
Features
add --max-tasks option for concurrent task execution in eval command (#279 ) (241e653 )
add bbq benchmark (#255 ) (46f4744 )
add ChartQAPro (#289 ) (677f7c7 )
add configurable HuggingFace Hub config naming (#261 ) (8abe2ae )
add DocVQA benchmark (#297 ) (0dd0edf )
add fuzzy match suggestion for misspelled evals (#303 ) (625a7b3 )
add ifbench benchmark (#326 ) (bd730c2 )
add math EvalGroup (#263 ) (e0f4a9b )
add MathVista benchmark (#298 ) (5c50a8f )
add MMLU-Redux benchmark from lighteval (#321 ) (d22a587 )
add MMVet V2 benchmark (#296 ) (66689de )
add OCRBench V2 benchmark (#295 ) (71f3589 )
add optional extras for simpleqa and toxicity (#266 ) (2450ddf )
add sealqa benchmark (#283 ) (06b39e4 )
add SMT 2024 benchmarks (#239 ) (5d9b475 )
add tau bench, pass^k metric (#294 ) (2bb1242 )
agentdojo: port agentdojo benchmark (#223 ) (1cf174c )
cli: added export command to exposrt specific logs to hf (#265 ) (62e8d8c )
cvebench: added auto prepare env set up for cvebench (#259 ) (db238a3 )
deepresearch-bench: add deepresearch bench (#288 ) (d2b4622 )
docs: docs for unsupported providers (#312 ) (3a3d4b8 )
docs: search capability benchmarks feature page (#287 ) (9dd27c1 )
evals: add GSM8K benchmark with shared grade school math scorer (#322 ) (4559a67 )
evals: add QA benchmarks and shared scorer (#323 ) (0ea3733 )
factscore: added support for factscore (#258 ) (13aafd7 )
gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284 ) (815f51b )
groq: implement configurable timeout for GroqAPI client (#271 ) (be492b6 )
groq: streaming support (#313 ) (c1a20be )
m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222 ) (6b8f2b1 )
PolygloToxicityPrompts: add multilingual toxicity evaluation (#262 ) (46de7ee )
provider: add helicone support (#275 ) (de6ab04 )
provider: add SiliconFlow provider support (#269 ) (ce14070 )
providers: add W&B Inference model provider (#264 ) (a02c34f )
rocketscience: add rocketscience benchmark support (#277 ) (73bcfc2 )
simpleqa_verified: add SimpleQA Verified benchmark (#249 ) (8a512c4 )
vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272 ) (d0eff6f )
Bug Fixes
add args to eval command (#276 ) (0e06988 )
allow subtasks into eval group summary (#306 ) (ae82757 )
deps: catch import warnings from optional deps (#327 ) (434fe88 )
docs: markdown formatting issue (#314 ) (24af36f )
docs: reasoning-effort docs clarity (#278 ) (2644619 )
docvqa: remove docvqa from config and dep group (#328 ) (162b8b5 )
factscore import issues, vLLM timeout bug (#273 ) (1674528 )
factscore: fix module level import error for optional dep (#274 ) (99594ff )
fix global import warning for optional dep (#307 ) (c44c8de )
friendliai token env name (#286 ) (a197828 )
livemcpbench: catch errors on call_tool and route (#260 ) (0ab746d )
math: shorten math group (#268 ) (19cc66b )
refactor factscore (#300 ) (ab3e84e )
remove nonexistent docvqa import (#318 ) (90a15a2 )
rename gpt_oss_aime to gpt_oss_aime25 (b378715 )
run mmmu as task instead of aggregate of subsets (#315 ) (623fbed )
simpleqa_verified: silence mypy for optional kagglehub import (#257 ) (32a1ff4 )
using huggingface instead of kagglehub for simpleqa_verified benchmark (#270 ) (8ee1efa )
Documentation
add groq configuration and embed updates email form (#301 ) (320a542 )
add missing docstrings and type hints for code clarity (#221 ) (38d34a0 )
Chores
fix deprecated methods for dataset loading with scripts (#267 ) (4c503f6 )
GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (5b08987 )
GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (aa1ab26 )
groq: docs and tests for streaming; stream=true default (#319 ) (fa1a8d0 )
pre-commit hook for test-registry-imports (#334 ) (85407e1 )
prune README, move extra info to docs (#336 ) (f57a340 )
push openbench-core to pyx (#292 ) (0d28e5d )
reduce math EvalGroup to most recent tasks only (420dcb9 )
remove docvqa (#317 ) (9068c23 )
rename openbench-core to openbench-core instead of openbench (#290 ) (2056f92 )
upgrade numpy version and update uv lock (#281 ) (15f2dbf )
Refactor
create shared image loading utilities for multimodal tasks (#305 ) (905932f )
move pass^k to a custom metric rather than scorer (#310 ) (ed0eb8d )
You can’t perform that action at this time.