Add evals support

**Product Context**

Evals are increasingly becoming the preferred way for developers to test the validity of their LLM. Imagine if we could wire up evals to CI - bringing the same rapid response as test failures within a PR context.


**How would we do it**

Sentry Vitest Evals {https://github.com/getsentry/vitest-evals/issues/13 } will output evals results. The result would contain a score (see block)

```

 "meta": {
                        "eval": {
                            "scores": [
                                {
                                    "score": 0.6,
                                    "metadata": {
                                        "rationale": "The submitted answer is a superset of the expert answer and is fully consistent with it. The expert answer identifies the root cause as a mismatch in the bottle ID passed to the `bottleById` function, which results in a 'Bottle not found' error. The submitted answer includes this same root cause but provides additional details, such as the specific IDs involved (3216 and 16720), and offers a comprehensive proposed solution and implementation strategy. This includes steps to inspect and correct the client-side code, verify parameter mapping, and test the fix. Therefore, the submission expands upon the expert's analysis without contradicting it."
                                    },
                                    "name": "Factuality2"
                                }
                            ],
                            "avgScore": 0.6
                        }
                    }

```

Display `score` `metadata` and `name` on the PR (along with test results) for every commit that uploads test results + eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add evals support #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add evals support #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions