Skip to content

Synthetic Test Data Set Generation #292

@cantemizyurek

Description

@cantemizyurek

Problem

LLM evaluations depend on abundant, diverse, and regularly refreshed test data, yet manual dataset creation is slow, costly, and fails to capture complex reasoning, multi‑hop queries, edge cases, and adversarial prompts. This leads to incomplete coverage, inconsistent regression testing, and limited confidence in real world performance.

Solution

I propose a synthetic test data pipeline that generates domain grounded evaluations using a knowledge graph to structure entities and relations, personas to vary intents and language styles, and both single‑hop and multi‑hop synthesizers to control reasoning depth. The pipeline produces balanced test sets across difficulty and scenarios for continuous regression, with traceable provenance from graph nodes, persona templates, and hop strategies.

Knowledge Graph

Image

We ingest source content, use chunking strategies to split it into chunks, then run extractors to label entities and topics per chunk. Each chunk becomes a node, and we build edges with topic similarity and entity overlap to connect related nodes. This yields a compact, navigable graph that mirrors domain structure and preserves provenance to the original content.

Scenario Generation

Image

We sample graph nodes to assemble context passages, then use single‑hop and multi‑hop synthesizers to generate queries that require shallow or compositional reasoning. Personas vary intent, tone, and language style, while configurable question lengths and formats ensure balanced coverage. Each scenario outputs the user query, the source context used, and a golden/reference answer with provenance to the contributing nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions