-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Product Context
Evals are increasingly becoming the preferred way for developers to test the validity of their LLM. Imagine if we could wire up evals to CI - bringing the same rapid response as test failures within a PR context.
How would we do it
Sentry Vitest Evals {getsentry/vitest-evals#13 } will output evals results. The result would contain a score (see block)
"meta": {
"eval": {
"scores": [
{
"score": 0.6,
"metadata": {
"rationale": "The submitted answer is a superset of the expert answer and is fully consistent with it. The expert answer identifies the root cause as a mismatch in the bottle ID passed to the `bottleById` function, which results in a 'Bottle not found' error. The submitted answer includes this same root cause but provides additional details, such as the specific IDs involved (3216 and 16720), and offers a comprehensive proposed solution and implementation strategy. This includes steps to inspect and correct the client-side code, verify parameter mapping, and test the fix. Therefore, the submission expands upon the expert's analysis without contradicting it."
},
"name": "Factuality2"
}
],
"avgScore": 0.6
}
}
Display score metadata and name on the PR (along with test results) for every commit that uploads test results + eval