Pinned Loading
-
laude-institute/terminal-bench
laude-institute/terminal-bench PublicA benchmark for LLMs on complicated tasks in the terminal
-
lm-evaluation-harness
lm-evaluation-harness PublicForked from EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
Python
-
harbor
harbor PublicForked from laude-institute/harbor
Harbor is a framework for running agent evaluations and creating and using RL environments.
Python
-
polybench-parsers
polybench-parsers PublicTest output parsers for various programming languages and testing frameworks
Python
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.

