Skip to content

Commit 54fa998

Browse files
chore: release 0.5.3 (#256)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent f57a340 commit 54fa998

File tree

3 files changed

+93
-2
lines changed

3 files changed

+93
-2
lines changed

.release-please-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
".": "0.5.2"
2+
".": "0.5.3"
33
}

CHANGELOG.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,96 @@
11
# Changelog
22

3+
## [0.5.3](https://github.com/groq/openbench/compare/v0.5.2...v0.5.3) (2025-12-08)
4+
5+
6+
### Features
7+
8+
* add --max-tasks option for concurrent task execution in eval command ([#279](https://github.com/groq/openbench/issues/279)) ([241e653](https://github.com/groq/openbench/commit/241e65392b04c747ab92b6ca7fcf3af825326869))
9+
* add bbq benchmark ([#255](https://github.com/groq/openbench/issues/255)) ([46f4744](https://github.com/groq/openbench/commit/46f4744fa0e381cc50202532499efb83be1aba0c))
10+
* add ChartQAPro ([#289](https://github.com/groq/openbench/issues/289)) ([677f7c7](https://github.com/groq/openbench/commit/677f7c7aa7f035f798e9135756cfadbebcac0368))
11+
* add configurable HuggingFace Hub config naming ([#261](https://github.com/groq/openbench/issues/261)) ([8abe2ae](https://github.com/groq/openbench/commit/8abe2aefb3fe11ec4bb9c36deaf1ee4a21cf2656))
12+
* add DocVQA benchmark ([#297](https://github.com/groq/openbench/issues/297)) ([0dd0edf](https://github.com/groq/openbench/commit/0dd0edf4fc606a6f967548083eae31149e72922f))
13+
* add fuzzy match suggestion for misspelled evals ([#303](https://github.com/groq/openbench/issues/303)) ([625a7b3](https://github.com/groq/openbench/commit/625a7b333c42ecd653a7d98ee15f154acc7e59d1))
14+
* add ifbench benchmark ([#326](https://github.com/groq/openbench/issues/326)) ([bd730c2](https://github.com/groq/openbench/commit/bd730c25fd95fb602f72d45a06cd3744cc08da0a))
15+
* add math EvalGroup ([#263](https://github.com/groq/openbench/issues/263)) ([e0f4a9b](https://github.com/groq/openbench/commit/e0f4a9b9ab9ebdc327b601bc5d8c1ee52c2e878d))
16+
* add MathVista benchmark ([#298](https://github.com/groq/openbench/issues/298)) ([5c50a8f](https://github.com/groq/openbench/commit/5c50a8fd68211257764aea2160dca61fd62c28b9))
17+
* add MMLU-Redux benchmark from lighteval ([#321](https://github.com/groq/openbench/issues/321)) ([d22a587](https://github.com/groq/openbench/commit/d22a587d98fb27a3c00e2d557805749e8f3b2bcb))
18+
* add MMVet V2 benchmark ([#296](https://github.com/groq/openbench/issues/296)) ([66689de](https://github.com/groq/openbench/commit/66689de4622f477f8d88cb37c0d541174a91a81d))
19+
* add OCRBench V2 benchmark ([#295](https://github.com/groq/openbench/issues/295)) ([71f3589](https://github.com/groq/openbench/commit/71f3589f6802d1df587f3c288be608bef622c0fb))
20+
* add optional extras for simpleqa and toxicity ([#266](https://github.com/groq/openbench/issues/266)) ([2450ddf](https://github.com/groq/openbench/commit/2450ddf76ad546784fbf63ff25f5f48715d65516))
21+
* add sealqa benchmark ([#283](https://github.com/groq/openbench/issues/283)) ([06b39e4](https://github.com/groq/openbench/commit/06b39e465cb73241e6cc4289c6417f1e347b7620))
22+
* add SMT 2024 benchmarks ([#239](https://github.com/groq/openbench/issues/239)) ([5d9b475](https://github.com/groq/openbench/commit/5d9b4752133aeb5562f0f22c793a391dcb70bde4))
23+
* add tau bench, pass^k metric ([#294](https://github.com/groq/openbench/issues/294)) ([2bb1242](https://github.com/groq/openbench/commit/2bb12420c87776cd0570b9aae5c6b9f09967408a))
24+
* **agentdojo:** port agentdojo benchmark ([#223](https://github.com/groq/openbench/issues/223)) ([1cf174c](https://github.com/groq/openbench/commit/1cf174c99fd03ed941073bb0e257e3ce8719b03d))
25+
* **cli:** added export command to exposrt specific logs to hf ([#265](https://github.com/groq/openbench/issues/265)) ([62e8d8c](https://github.com/groq/openbench/commit/62e8d8c8dfc2f79f2fd1f031cbebce9979dcdf8a))
26+
* **cvebench:** added auto prepare env set up for cvebench ([#259](https://github.com/groq/openbench/issues/259)) ([db238a3](https://github.com/groq/openbench/commit/db238a3ce5417ebee9a89206ac860c52c83a2b92))
27+
* **deepresearch-bench:** add deepresearch bench ([#288](https://github.com/groq/openbench/issues/288)) ([d2b4622](https://github.com/groq/openbench/commit/d2b4622e4508901a3829af50806733a431c0152c))
28+
* **docs:** docs for unsupported providers ([#312](https://github.com/groq/openbench/issues/312)) ([3a3d4b8](https://github.com/groq/openbench/commit/3a3d4b81230d4b574cf106169c09fa2c7f9fce3d))
29+
* **docs:** search capability benchmarks feature page ([#287](https://github.com/groq/openbench/issues/287)) ([9dd27c1](https://github.com/groq/openbench/commit/9dd27c19ac49571ba42d543a2eeb0b5bb438591d))
30+
* **evals:** add GSM8K benchmark with shared grade school math scorer ([#322](https://github.com/groq/openbench/issues/322)) ([4559a67](https://github.com/groq/openbench/commit/4559a6731f6ddd2b3e5aa3a26806c31f689e5f28))
31+
* **evals:** add QA benchmarks and shared scorer ([#323](https://github.com/groq/openbench/issues/323)) ([0ea3733](https://github.com/groq/openbench/commit/0ea373319e27238b2162851f1807dfc95f98fe00))
32+
* **factscore:** added support for factscore ([#258](https://github.com/groq/openbench/issues/258)) ([13aafd7](https://github.com/groq/openbench/commit/13aafd783f975a94dc01b43a814b89b9f322651d))
33+
* **gpt_oss:** add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down ([#284](https://github.com/groq/openbench/issues/284)) ([815f51b](https://github.com/groq/openbench/commit/815f51bee6e294016677c501636e4ce20eaa4070))
34+
* **groq:** implement configurable timeout for GroqAPI client ([#271](https://github.com/groq/openbench/issues/271)) ([be492b6](https://github.com/groq/openbench/commit/be492b6d3478e00420b5a56ad45c2582bf36becf))
35+
* **groq:** streaming support ([#313](https://github.com/groq/openbench/issues/313)) ([c1a20be](https://github.com/groq/openbench/commit/c1a20be0e1eb6179e7d287109f09a7cb875b88fb))
36+
* **m2s:** added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) ([#222](https://github.com/groq/openbench/issues/222)) ([6b8f2b1](https://github.com/groq/openbench/commit/6b8f2b1bf938659310ddbf7cc3f5c85644f096e5))
37+
* **PolygloToxicityPrompts:** add multilingual toxicity evaluation ([#262](https://github.com/groq/openbench/issues/262)) ([46de7ee](https://github.com/groq/openbench/commit/46de7ee0f6c1a34514a092e49f04ed9ada355dce))
38+
* **provider:** add helicone support ([#275](https://github.com/groq/openbench/issues/275)) ([de6ab04](https://github.com/groq/openbench/commit/de6ab04a50b4631c69c87716edf1432e3367967d))
39+
* **provider:** add SiliconFlow provider support ([#269](https://github.com/groq/openbench/issues/269)) ([ce14070](https://github.com/groq/openbench/commit/ce140708f5d9a6234c88e8b98c0ada3c2da7d590))
40+
* **providers:** add W&B Inference model provider ([#264](https://github.com/groq/openbench/issues/264)) ([a02c34f](https://github.com/groq/openbench/commit/a02c34fa693668d56795754b21ff59381eeda0b9))
41+
* **rocketscience:** add rocketscience benchmark support ([#277](https://github.com/groq/openbench/issues/277)) ([73bcfc2](https://github.com/groq/openbench/commit/73bcfc273292f3e4d603fb5eb56484cc3a4e85a4))
42+
* **simpleqa_verified:** add SimpleQA Verified benchmark ([#249](https://github.com/groq/openbench/issues/249)) ([8a512c4](https://github.com/groq/openbench/commit/8a512c48613a752574248c6acbb106a9bb8a2927))
43+
* **vllm:** add openbench override for Inspect AI's built-in vllm provider that doesn't start a server ([#272](https://github.com/groq/openbench/issues/272)) ([d0eff6f](https://github.com/groq/openbench/commit/d0eff6f3dc6e4ec329d087bc538fb0621c4d2b2e))
44+
45+
46+
### Bug Fixes
47+
48+
* add args to eval command ([#276](https://github.com/groq/openbench/issues/276)) ([0e06988](https://github.com/groq/openbench/commit/0e06988a7e15a55041719b5cefc129a43aa77aa7))
49+
* allow subtasks into eval group summary ([#306](https://github.com/groq/openbench/issues/306)) ([ae82757](https://github.com/groq/openbench/commit/ae82757e892a2c9302dd962bf46871f8ae1a52bf))
50+
* **deps:** catch import warnings from optional deps ([#327](https://github.com/groq/openbench/issues/327)) ([434fe88](https://github.com/groq/openbench/commit/434fe8877d00a58717ab8c6ddbca2a6daa726ad0))
51+
* **docs:** markdown formatting issue ([#314](https://github.com/groq/openbench/issues/314)) ([24af36f](https://github.com/groq/openbench/commit/24af36f3f67ed4c51b0539b498e54c1840af31ed))
52+
* **docs:** reasoning-effort docs clarity ([#278](https://github.com/groq/openbench/issues/278)) ([2644619](https://github.com/groq/openbench/commit/2644619a6523f27f5e446f2df89d3ed3a473bad1))
53+
* **docvqa:** remove docvqa from config and dep group ([#328](https://github.com/groq/openbench/issues/328)) ([162b8b5](https://github.com/groq/openbench/commit/162b8b54a7632f78b6486d48044870520bbaf167))
54+
* factscore import issues, vLLM timeout bug ([#273](https://github.com/groq/openbench/issues/273)) ([1674528](https://github.com/groq/openbench/commit/1674528799fd0be99d6f684bc5d59ead66fa6fa8))
55+
* **factscore:** fix module level import error for optional dep ([#274](https://github.com/groq/openbench/issues/274)) ([99594ff](https://github.com/groq/openbench/commit/99594ff2868bbaae1f1ad5ccd7dd783636679698))
56+
* fix global import warning for optional dep ([#307](https://github.com/groq/openbench/issues/307)) ([c44c8de](https://github.com/groq/openbench/commit/c44c8dee98e757358f85a4e0baf98d9730fa4031))
57+
* friendliai token env name ([#286](https://github.com/groq/openbench/issues/286)) ([a197828](https://github.com/groq/openbench/commit/a197828a0ce3e9bb73330e33baab1df606fa7f2a))
58+
* **livemcpbench:** catch errors on call_tool and route ([#260](https://github.com/groq/openbench/issues/260)) ([0ab746d](https://github.com/groq/openbench/commit/0ab746dd6788048ab267cb2cee4615492c607cf9))
59+
* **math:** shorten math group ([#268](https://github.com/groq/openbench/issues/268)) ([19cc66b](https://github.com/groq/openbench/commit/19cc66b3779bd1442e49193a53af3e39201887f5))
60+
* refactor factscore ([#300](https://github.com/groq/openbench/issues/300)) ([ab3e84e](https://github.com/groq/openbench/commit/ab3e84ef91d3aba05281a790af30aa36adcecc91))
61+
* remove nonexistent docvqa import ([#318](https://github.com/groq/openbench/issues/318)) ([90a15a2](https://github.com/groq/openbench/commit/90a15a2bb5dd5010edaa595685d9c5ffab38419f))
62+
* rename gpt_oss_aime to gpt_oss_aime25 ([b378715](https://github.com/groq/openbench/commit/b3787156f4a09fa8ddff7f3f54daac98bbde8536))
63+
* run mmmu as task instead of aggregate of subsets ([#315](https://github.com/groq/openbench/issues/315)) ([623fbed](https://github.com/groq/openbench/commit/623fbed92900cb9e6b0118ccded45ff5689c45af))
64+
* **simpleqa_verified:** silence mypy for optional kagglehub import ([#257](https://github.com/groq/openbench/issues/257)) ([32a1ff4](https://github.com/groq/openbench/commit/32a1ff4582ba8228ab11837e550c0c6d2612f8a3))
65+
* using huggingface instead of kagglehub for simpleqa_verified benchmark ([#270](https://github.com/groq/openbench/issues/270)) ([8ee1efa](https://github.com/groq/openbench/commit/8ee1efaa3712e83d40c30550ec92f33078febc7b))
66+
67+
68+
### Documentation
69+
70+
* add groq configuration and embed updates email form ([#301](https://github.com/groq/openbench/issues/301)) ([320a542](https://github.com/groq/openbench/commit/320a542883443a7e0658223636256eb4bfdfae0a))
71+
* add missing docstrings and type hints for code clarity ([#221](https://github.com/groq/openbench/issues/221)) ([38d34a0](https://github.com/groq/openbench/commit/38d34a0882c04284f3f4f2ed7bebfe2b73766d0d))
72+
73+
74+
### Chores
75+
76+
* fix deprecated methods for dataset loading with scripts ([#267](https://github.com/groq/openbench/issues/267)) ([4c503f6](https://github.com/groq/openbench/commit/4c503f6ac10c192dee40a4c979aac01302661a11))
77+
* GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] ([5b08987](https://github.com/groq/openbench/commit/5b0898792bba99e8ef086b1bc5fca3f829185864))
78+
* GitHub Terraform: Create/Update .github/workflows/code-freeze-bypass.yaml [skip ci] ([aa1ab26](https://github.com/groq/openbench/commit/aa1ab26680687e58f7fb1e2a0f7b0e5c79acc9bd))
79+
* **groq:** docs and tests for streaming; stream=true default ([#319](https://github.com/groq/openbench/issues/319)) ([fa1a8d0](https://github.com/groq/openbench/commit/fa1a8d0755b06c8c45289e05dc6226733419f0dc))
80+
* pre-commit hook for test-registry-imports ([#334](https://github.com/groq/openbench/issues/334)) ([85407e1](https://github.com/groq/openbench/commit/85407e16ff27921db0e6f947823738b7212f0ace))
81+
* prune README, move extra info to docs ([#336](https://github.com/groq/openbench/issues/336)) ([f57a340](https://github.com/groq/openbench/commit/f57a340a51fdf2e5d228e98e029e4e6f109f4689))
82+
* push openbench-core to pyx ([#292](https://github.com/groq/openbench/issues/292)) ([0d28e5d](https://github.com/groq/openbench/commit/0d28e5d4c19aed0eb8dc61d848efba954e3a6181))
83+
* reduce math EvalGroup to most recent tasks only ([420dcb9](https://github.com/groq/openbench/commit/420dcb9ef13c0c4afaa7b38c9616dae24d35766e))
84+
* remove docvqa ([#317](https://github.com/groq/openbench/issues/317)) ([9068c23](https://github.com/groq/openbench/commit/9068c2388942cb3ea493de68ab915d78d0003f4c))
85+
* rename openbench-core to openbench-core instead of openbench ([#290](https://github.com/groq/openbench/issues/290)) ([2056f92](https://github.com/groq/openbench/commit/2056f9209cb4ea2fc0e03181c24c4694dc7c7ede))
86+
* upgrade numpy version and update uv lock ([#281](https://github.com/groq/openbench/issues/281)) ([15f2dbf](https://github.com/groq/openbench/commit/15f2dbf6d6a67ff12b2910377179400b82b1d2b3))
87+
88+
89+
### Refactor
90+
91+
* create shared image loading utilities for multimodal tasks ([#305](https://github.com/groq/openbench/issues/305)) ([905932f](https://github.com/groq/openbench/commit/905932f9c7ae61a55b19aab78e8198ce95c2eef2))
92+
* move pass^k to a custom metric rather than scorer ([#310](https://github.com/groq/openbench/issues/310)) ([ed0eb8d](https://github.com/groq/openbench/commit/ed0eb8d2ee83fec7914d391376a1fec5e54a142d))
93+
394
## [0.5.2](https://github.com/groq/openbench/compare/v0.5.1...v0.5.2) (2025-10-16)
495

596

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "openbench"
7-
version = "0.5.2"
7+
version = "0.5.3"
88
requires-python = ">=3.10"
99
description = "openbench - open source, replicable, and standardized evaluation infrastructure"
1010
readme = "README.md"

0 commit comments

Comments
 (0)