Synthetic Test Data Set Generation

## Problem

LLM evaluations depend on abundant, diverse, and regularly refreshed test data, yet manual dataset creation is slow, costly, and fails to capture complex reasoning, multi‑hop queries, edge cases, and adversarial prompts. This leads to incomplete coverage, inconsistent regression testing, and limited confidence in real world performance.

## Solution 

I propose a synthetic test data pipeline that generates domain grounded evaluations using a knowledge graph to structure entities and relations, personas to vary intents and language styles, and both single‑hop and multi‑hop synthesizers to control reasoning depth. The pipeline produces balanced test sets across difficulty and scenarios for continuous regression, with traceable provenance from graph nodes, persona templates, and hop strategies.

### Knowledge Graph

<img width="2582" height="1090" alt="Image" src="https://github.com/user-attachments/assets/7e226f72-d6cb-46c5-918b-3a34839c4310" />

We ingest source content, use chunking strategies to split it into chunks, then run extractors to label entities and topics per chunk. Each chunk becomes a node, and we build edges with topic similarity and entity overlap to connect related nodes. This yields a compact, navigable graph that mirrors domain structure and preserves provenance to the original content.

### Scenario Generation

<img width="2046" height="1036" alt="Image" src="https://github.com/user-attachments/assets/6bd0daa3-da7f-45b9-978a-4d6f6779d7eb" />

We sample graph nodes to assemble context passages, then use single‑hop and multi‑hop synthesizers to generate queries that require shallow or compositional reasoning. Personas vary intent, tone, and language style, while configurable question lengths and formats ensure balanced coverage. Each scenario outputs the user query, the source context used, and a golden/reference answer with provenance to the contributing nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic Test Data Set Generation #292

Problem

Solution

Knowledge Graph

Scenario Generation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Synthetic Test Data Set Generation #292

Description

Problem

Solution

Knowledge Graph

Scenario Generation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions