Skip to content

This is the part 2 of a GraphRAG system, in which the user interacts with the data through 2 data structures: vector database (Chroma DB) and graph database (Neo4j). I exploit the probabilism of vector embedding and the determinism of a knowledge graph to minimise hallucination and maximise explainability. The domain: ‘Electronic’ music genre.

License

Notifications You must be signed in to change notification settings

pacoreyes/GraphRagPart2RetrievalAndChatbot

Repository files navigation

GraphRAG - Part 2: Orchestration, Retrieval, and Chatbot

Last update: January 21, 2026

[THIS PROJECT IS IN PROGRESS AND THE CHATBOT WILL BE DEPLOYED ON THE CLOUD FOR PUBLIC ACCESS]

This repo is the second part of a larger GraphRAG application, demonstrating how the GraphRAG pattern works for intelligent question-answering over a music knowledge base. You can find Part 1, the GraphRAG Data Pipeline here: GraphRAG Part 1: Data Pipeline


About the Project

This system is a high-performance, privacy-focused Retrieval-Augmented Generation (RAG) agent that orchestrates three distinct data retrieval strategies based on user intent:

  • Graph Retrieval (Neo4j) — For deterministic facts and relationship traversals
  • Vector Retrieval (ChromaDB) — For semantic/contextual queries and "vibe" questions
  • Deep Metadata (JSON Sidecar) — For detailed attributes like barcodes, packaging, and social stats

The architecture follows a "Skinny Graph, Fat Context" philosophy: the graph database stays lean and optimized for traversals, while rich contextual data lives in appropriate external stores. All layers are unified via a strict Identity Fabric using Wikidata QIDs and MusicBrainz MBIDs.

Graph of Orchestration of LangGraph Picture 1. Graph of Orchestration of LangGraph


Domain: Electronic Music

This system is specifically tuned for the Electronic Music domain. It captures the rich, interconnected history of electronic artists, from early pioneers to contemporary producers. The dataset encompasses a wide range of sub-genres—including Techno, House, Ambient, IDM, and Drum & Bass—modeling the complex relationships between artists, their releases, and the evolving taxonomy of electronic musical styles.


Tech Stack

Category Technology
Orchestration LangGraph
Graph Database Neo4j (Cloud - Neo4jAura)
Vector Store ChromaDB
Embeddings nomic-ai/nomic-embed-text-v1.5
LLM Inference MLX (Apple Silicon native)
Models gpt-oss-20b-MLX-8bit (Router/Generalist), Gemma-3-4B-Instruct (Text-to-Cypher)
Web UI Chainlit
Configuration Pydantic / pydantic-settings
Logging structlog
Package Manager uv (Astral)

Architecture Overview

The system functions as a Meta-Agent that orchestrates specialized retrievers based on user intent, prioritizing precision (Graph) over approximation (Vector), with a Safety Net protocol to ensure zero-hallucination failures.

Tiered Data Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          USER QUERY                                     │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    ENTITY RESOLUTION & ROUTING                          │
│         (Generative NER + Neo4j Fulltext Index Disambiguation)          │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
            ▼                       ▼                       ▼
┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
│   TIER 1: GRAPH   │   │  TIER 2: SIDECAR  │   │  TIER 3: VECTOR   │
│      (Neo4j)      │   │      (JSON)       │   │    (ChromaDB)     │
│                   │   │                   │   │                   │
│ • Relationships   │   │ • Barcodes        │   │ • Biographies     │
│ • Topology        │   │ • Packaging       │   │ • Reviews         │
│ • Aggregations    │   │ • Social stats    │   │ • Semantic search │
└───────────────────┘   └───────────────────┘   └───────────────────┘
            │                       │                       │
            └───────────────────────┼───────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         FUSION & SYNTHESIS                              │
│                  (GPT-OSS 20B Context Window Fusion)                    │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          FINAL ANSWER                                   │
└─────────────────────────────────────────────────────────────────────────┘

LangGraph Workflow

The orchestration follows a Perception → Strategy → Execution → Synthesis cycle implemented as a LangGraph state machine:

flowchart TD
    A[User Query] --> B[Entity Resolution]
    B --> C{Router}
    C -->|GRAPH_CYPHER| D[Generate Cypher]
    C -->|GRAPH_TOOL| E[Track Search]
    C -->|VECTOR_ONLY| F[Vector Search]
    C -->|SIDECAR| G[Fetch Metadata]

    D --> H[Execute Cypher]
    H -->|Success| I[Fusion]
    H -->|Error| D
    H -->|Empty/Fallback| F

    E -->|Success| I
    E -->|Fallback| F

    F --> I
    G --> I

    I --> J[Final Answer]
Loading

System Components

Component Model/Technology Role
Router gpt-oss-20b (MLX) Generative NER - extracts entities, types, and intent
Resolver Neo4j Fulltext Indexes Disambiguates entities via Lucene scoring
Graph Specialist Gemma-3-4B-Instruct (Text-to-Cypher) Generates read-only Cypher queries
Metadata Fetcher Python + JSON Retrieves deep metadata from sidecar files
Vector Specialist ChromaDB + Nomic embeddings Semantic search with "Walled Garden" filtering
Generalist gpt-oss-20b (MLX) Synthesizes final answers from combined context

Graph Schema

erDiagram
    Artist ||--o{ Genre : PLAYS_GENRE
    Artist ||--o{ Artist : SIMILAR_TO
    Artist ||--o| Country : FROM_COUNTRY
    Release ||--|| Artist : PERFORMED_BY
    Genre ||--o{ Genre : SUBGENRE_OF

    Artist {
        string id
        string name
        string mbid
        string qid
        string_list aliases
    }
    Release {
        string id
        string title
        int year
        string_list tracks
    }
    Genre {
        string id
        string name
        string_list aliases
    }
    Country {
        string id
        string name
    }
Loading

Safety Mechanisms

The architecture implements robust error handling and hallucination prevention:

  1. Self-Correction Loop: Cypher syntax errors are fed back to the model for retry (max 3 attempts)
  2. Vector Safety Net: Empty graph results trigger automatic fallback to semantic search
  3. Hallucination Prevention: Low-confidence results return a graceful "not found" response instead of fabricated answers
  4. Read-Only Enforcement: Generated Cypher queries are validated to prevent destructive operations

Project Structure

src/
├── agent/
│   ├── graph.py           # LangGraph workflow definition
│   ├── state.py           # TypedDict state schema
│   ├── entity_resolver.py # Entity extraction & linking
│   ├── specialist.py      # Cypher generation (Gemma 4B)
│   ├── generalist.py      # Answer synthesis (GPT-OSS 20B)
│   └── security.py        # Cypher validation
├── utils/
│   ├── neo4j_helper.py    # Neo4j driver & queries
│   └── vector_helper.py   # ChromaDB client & embeddings
├── settings.py            # Pydantic configuration
└── schemas.py             # Orchestration schemas

tests/
├── unit_tests/            # Component tests
├── integration_tests/     # Full workflow tests
└── conftest.py           # Pytest fixtures

main.py                    # Chainlit UI entry point

Question-oriented Design

The system is validated using a comprehensive pool of questions designed to test specific architectural components, from simple graph traversals to complex multi-hop reasoning.

Tier 1: Graph Topology (Fact Retrieval)

Tests the Cypher generation and Graph Specialist.

  1. What country is the band Kraftwerk from?
  2. List all subgenres of "Industrial Techno".
  3. Which artists are legally considered "similar to" Depeche Mode according to the graph?
  4. What year was the album "Violator" released?
  5. How many distinct genres are associated with Aphex Twin?
  6. Find the shortest path between Daft Punk and The Chemical Brothers.
  7. Which artist has the most releases in the database?
  8. List all artists who have released albums in 1997.

Tier 2: Deep Metadata (JSON Sidecar)

Tests the Entity Hydrator and specific attribute lookup from JSON files.

  1. What is the barcode for the 2006 Digipak re-release of "Speak & Spell"?
  2. How many Twitter followers did Depeche Mode have in 2021?
  3. What is the specific catalog number for the US vinyl release of "Music for the Masses"?
  4. Did the 2006 remaster of "Violator" come in a jewel case or digipak?
  5. What is the exact release date (YYYY-MM-DD) of the French edition of "Homework"?
  6. Which record label published the Japanese version of "Selected Ambient Works 85-92"?
  7. What are the packaging dimensions or format details for the "Exai" box set?
  8. Retrieve the ISRC codes for all tracks on the album "Mezzanine".
  9. What is the "packaging" type listed for the 1990 UK release of "Violator"?
  10. Find the release with the barcode "0094635797923".

Tier 3: Semantic & Vibe (Vector Search)

Tests the Vector Specialist, embeddings, and "Walled Garden" filtering.

  1. Describe the political influence on Depeche Mode's sound in the early 1980s.
  2. What do critics say about the production quality of "Syro"?
  3. How did the break-up of Boards of Canada's previous band influence their sound?
  4. Find reviews that mention "claustrophobic atmosphere" in relation to Massive Attack.
  5. Summarize the critical reception of "Come to Daddy" at the time of its release.
  6. What are the recurring lyrical themes in Portishead's "Dummy"?
  7. Describe the "vibe" of early 90s Intelligent Dance Music (IDM).
  8. Find artist biographies that mention "Detroit" as a key influence.

Tier 4: Hybrid Fusion (Multi-Source)

Tests the Orchestrator's ability to combine Graph, Vector, and Sidecar data.

  1. Did the album with the highest track count by Autechre receive positive reviews?
  2. Compare the critical reception of Depeche Mode's 1981 releases vs. their 1990 releases.
  3. Which artist from France has the most followers on Twitter?
  4. List the genres played by artists who use "sampling" heavily in their production (based on bios).
  5. How does the release frequency of Aphex Twin correlate with his critical acclaim over time?
  6. Find albums released in 1994 by artists who are "similar to" Massive Attack.
  7. Did the "Digipak" version of "Violator" get better reviews than the standard jewel case version?
  8. Which genre has the most artists with "political" themes in their biographies?
  9. List all releases by German artists that are mentioned as "influential" in vector search results.
  10. Who is the most popular artist (by social stats) in the "Glitch" genre?

Entity Resolution & Disambiguation

Tests the Router, Fulltext Indexes, and "Did You Mean" logic.

  1. Tell me about "Nirvana" (expecting clarification between US and UK bands).
  2. Who is "The Boss" of techno? (Testing alias/nickname resolution).
  3. List albums by "DM" (Testing acronym resolution for Depeche Mode).
  4. Information about the band "Burial" (Testing for potential ambiguity with other entities).
  5. Tell me about "Homework" (Is it the album by Daft Punk or something else?).
  6. Stats for "Prince" (Testing disambiguation if multiple artists exist).
  7. "The Twins" (Testing alias resolution for Aphex Twin or other entities).

Safety Net & Error Handling

Tests Hallucination Prevention and Fallback logic.

  1. What is the tracklist for the 2025 album by Daft Punk? (Should verify non-existence).
  2. Who played bass on the album "Unknown Pleasures" by Autechre? (False premise/hallucination check).
  3. What is the barcode for a release that doesn't exist?
  4. Tell me about the artist "FakeName123".
  5. What genre is the band "NonExistentEntity" associated with?
  6. Retrieve metadata for a null MBID.
  7. Generate a biography for an artist not in the database.

About

This is the part 2 of a GraphRAG system, in which the user interacts with the data through 2 data structures: vector database (Chroma DB) and graph database (Neo4j). I exploit the probabilism of vector embedding and the determinism of a knowledge graph to minimise hallucination and maximise explainability. The domain: ‘Electronic’ music genre.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published