-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Problem
LLM evaluations depend on abundant, diverse, and regularly refreshed test data, yet manual dataset creation is slow, costly, and fails to capture complex reasoning, multi‑hop queries, edge cases, and adversarial prompts. This leads to incomplete coverage, inconsistent regression testing, and limited confidence in real world performance.
Solution
I propose a synthetic test data pipeline that generates domain grounded evaluations using a knowledge graph to structure entities and relations, personas to vary intents and language styles, and both single‑hop and multi‑hop synthesizers to control reasoning depth. The pipeline produces balanced test sets across difficulty and scenarios for continuous regression, with traceable provenance from graph nodes, persona templates, and hop strategies.
Knowledge Graph
We ingest source content, use chunking strategies to split it into chunks, then run extractors to label entities and topics per chunk. Each chunk becomes a node, and we build edges with topic similarity and entity overlap to connect related nodes. This yields a compact, navigable graph that mirrors domain structure and preserves provenance to the original content.
Scenario Generation
We sample graph nodes to assemble context passages, then use single‑hop and multi‑hop synthesizers to generate queries that require shallow or compositional reasoning. Personas vary intent, tone, and language style, while configurable question lengths and formats ensure balanced coverage. Each scenario outputs the user query, the source context used, and a golden/reference answer with provenance to the contributing nodes.