List view
**Milestone Description** This milestone aims to introduce an LLM-as-Judge evaluation framework that leverages a large language model to score or rank system outputs based on predefined criteria. This approach enables more flexible and human-aligned evaluation for tasks where simple string matching or numeric scoring is insufficient. Key objectives include: - Designing a promptable evaluation API that uses an LLM to judge output quality - Supporting customizable scoring rubrics and evaluation dimensions - Allowing multiple judging strategies (e.g., numeric score, pairwise comparison, categorical labels) - Ensuring reproducibility through temperature control, system prompts, and deterministic settings - Providing baseline judge prompts for common tasks (e.g., helpfulness, correctness, style, safety) - Adding utilities for batching, retries, and cost tracking during judge evaluations This milestone will enable more human-like, instruction-aligned evaluation workflows across diverse tasks.
No due dateThis milestone focuses on adding support for evaluating mathematical problem-solving capabilities. The evaluation should rely on structured answers extracted from the model’s response, typically using identifiable patterns such as LaTeX-style expressions (e.g., \boxed{...}). **Key objectives include:** - Implementing regex-based extraction of final answers from LLM responses - Supporting commonly used mathematical answer formats (e.g., \boxed{}, \answer{}, inline math $...$) - Adding evaluation logic to compare extracted answers with ground truth - Providing an easy-to-use API for math-specific evaluation workflows - Ensuring robust handling of formatting variations and whitespace normalization Delivering this milestone will enable precise and automated evaluation for math-oriented tasks, problem sets, and structured-answer benchmarks.
No due date•0/1 issues closedThis milestone aims to introduce logit-based evaluation capabilities into the package. Key objectives include: - Enabling access to model logits during inference or evaluation - Providing APIs/utilities for computing evaluation scores directly from logits - Supporting batch or large-scale evaluation workflows - Improving precision and reproducibility for tasks involving likelihood comparison or token-level analysis Delivering this milestone will expand the package’s evaluation capabilities and allow more fine-grained, consistent, and research-friendly workflows.
Overdue by 1 month(s)•Due by December 31, 2025•0/1 issues closed