Skip to content

v0.5.3

Latest

Choose a tag to compare

@github-actions github-actions released this 09 Dec 00:50
54fa998

0.5.3 (2025-12-08)

Features

  • add --max-tasks option for concurrent task execution in eval command (#279) (241e653)
  • add bbq benchmark (#255) (46f4744)
  • add ChartQAPro (#289) (677f7c7)
  • add configurable HuggingFace Hub config naming (#261) (8abe2ae)
  • add DocVQA benchmark (#297) (0dd0edf)
  • add fuzzy match suggestion for misspelled evals (#303) (625a7b3)
  • add ifbench benchmark (#326) (bd730c2)
  • add math EvalGroup (#263) (e0f4a9b)
  • add MathVista benchmark (#298) (5c50a8f)
  • add MMLU-Redux benchmark from lighteval (#321) (d22a587)
  • add MMVet V2 benchmark (#296) (66689de)
  • add OCRBench V2 benchmark (#295) (71f3589)
  • add optional extras for simpleqa and toxicity (#266) (2450ddf)
  • add sealqa benchmark (#283) (06b39e4)
  • add SMT 2024 benchmarks (#239) (5d9b475)
  • add tau bench, pass^k metric (#294) (2bb1242)
  • agentdojo: port agentdojo benchmark (#223) (1cf174c)
  • cli: added export command to exposrt specific logs to hf (#265) (62e8d8c)
  • cvebench: added auto prepare env set up for cvebench (#259) (db238a3)
  • deepresearch-bench: add deepresearch bench (#288) (d2b4622)
  • docs: docs for unsupported providers (#312) (3a3d4b8)
  • docs: search capability benchmarks feature page (#287) (9dd27c1)
  • evals: add GSM8K benchmark with shared grade school math scorer (#322) (4559a67)
  • evals: add QA benchmarks and shared scorer (#323) (0ea3733)
  • factscore: added support for factscore (#258) (13aafd7)
  • gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284) (815f51b)
  • groq: implement configurable timeout for GroqAPI client (#271) (be492b6)
  • groq: streaming support (#313) (c1a20be)
  • m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222) (6b8f2b1)
  • PolygloToxicityPrompts: add multilingual toxicity evaluation (#262) (46de7ee)
  • provider: add helicone support (#275) (de6ab04)
  • provider: add SiliconFlow provider support (#269) (ce14070)
  • providers: add W&B Inference model provider (#264) (a02c34f)
  • rocketscience: add rocketscience benchmark support (#277) (73bcfc2)
  • simpleqa_verified: add SimpleQA Verified benchmark (#249) (8a512c4)
  • vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272) (d0eff6f)

Bug Fixes

  • add args to eval command (#276) (0e06988)
  • allow subtasks into eval group summary (#306) (ae82757)
  • deps: catch import warnings from optional deps (#327) (434fe88)
  • docs: markdown formatting issue (#314) (24af36f)
  • docs: reasoning-effort docs clarity (#278) (2644619)
  • docvqa: remove docvqa from config and dep group (#328) (162b8b5)
  • factscore import issues, vLLM timeout bug (#273) (1674528)
  • factscore: fix module level import error for optional dep (#274) (99594ff)
  • fix global import warning for optional dep (#307) (c44c8de)
  • friendliai token env name (#286) (a197828)
  • livemcpbench: catch errors on call_tool and route (#260) (0ab746d)
  • math: shorten math group (#268) (19cc66b)
  • refactor factscore (#300) (ab3e84e)
  • remove nonexistent docvqa import (#318) (90a15a2)
  • rename gpt_oss_aime to gpt_oss_aime25 (b378715)
  • run mmmu as task instead of aggregate of subsets (#315) (623fbed)
  • simpleqa_verified: silence mypy for optional kagglehub import (#257) (32a1ff4)
  • using huggingface instead of kagglehub for simpleqa_verified benchmark (#270) (8ee1efa)

Documentation

  • add groq configuration and embed updates email form (#301) (320a542)
  • add missing docstrings and type hints for code clarity (#221) (38d34a0)

Chores

  • fix deprecated methods for dataset loading with scripts (#267) (4c503f6)
  • GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (5b08987)
  • GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] (aa1ab26)
  • groq: docs and tests for streaming; stream=true default (#319) (fa1a8d0)
  • pre-commit hook for test-registry-imports (#334) (85407e1)
  • prune README, move extra info to docs (#336) (f57a340)
  • push openbench-core to pyx (#292) (0d28e5d)
  • reduce math EvalGroup to most recent tasks only (420dcb9)
  • remove docvqa (#317) (9068c23)
  • rename openbench-core to openbench-core instead of openbench (#290) (2056f92)
  • upgrade numpy version and update uv lock (#281) (15f2dbf)

Refactor

  • create shared image loading utilities for multimodal tasks (#305) (905932f)
  • move pass^k to a custom metric rather than scorer (#310) (ed0eb8d)