Skip to content

A Reproduction and Practical Implementation of the SemRAG Research Paper Architecture

Notifications You must be signed in to change notification settings

ratul-d/Semantic-Graph-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Graph RAG: A SemRAG-Based Retrieval-Augmented Generation System

A Reproduction and Practical Implementation of the SemRAG Research Paper Architecture


Overview

This project is a Retrieval-Augmented Generation (RAG) system implementing the SemRAG architecture as described in the original paper. This system processes a provided PDF into semantic chunks, constructs a knowledge graph from extracted entities and relations, performs community detection, generates hierarchical community summaries, and executes both Local and Global Graph-based Retrieval when answering user queries. A local LLM (via Ollama) is used for both community report generation and final answer synthesis.

This implementation follows the SemRAG paper's methodology as closely as possible, including semantic chunking, graph construction, community hierarchy formation, dual retrieval strategy, and context fusion for LLM answering.


System Architecture

The system operates in two major phases:

  1. Initial Processing (Index Building)
  2. Query-Time Processing (Local + Global Search + Answer Generation)

Each stage is described below in detail.


Phase 1: Initial Processing

1. Semantic Chunking

Text is extracted from the provided PDF and transformed into semantically coherent chunks using the SemRAG methodology.

Steps:

  1. Extract raw text from PDF pages.
  2. Split text into sentences.
  3. Buffer-merge neighboring sentences to preserve contextual continuity.
  4. Convert merged sentences into embedding vectors.
  5. Compute cosine similarity between adjacent sentence embeddings.
  6. Group sentences into semantic chunks based on similarity thresholds.
  7. Split oversized chunks using overlapping windows to maintain coherence.

The result is a collection of high-quality semantic chunks aligned with the SemRAG specification.

2. Knowledge Graph Construction

For each semantic chunk:

  1. Entities are extracted using spaCy's NER model.
  2. Relations are extracted using dependency parsing (subject–verb–object patterns).
  3. Each entity becomes a node in the graph.
  4. Relations create edges between nodes, annotated with relationship types and supporting chunk references.

All extracted nodes and edges from all chunks form the global Knowledge Graph representing the document’s semantic structure.

3. Community Detection and Community Report Generation

SemRAG emphasizes hierarchical community-based retrieval. This is reproduced as follows:

  1. Apply a community detection algorithm (Louvain) on the knowledge graph.

  2. Each community consists of semantically related entities.

  3. For each community:

    • Collect all nodes.
    • Collect all edges and their supporting chunks.
    • Collect a few sample chunks from each entity and relation for evidence.
  4. Produce a community report by prompting a local LLM to summarize:

    • Entities in the community
    • Relations between them
    • Supporting textual evidence

Each community report is embedded and stored for Global Search during query-time retrieval.


Phase 2: Query-Time Processing

When a user submits a question, the system embeds the query and executes two retrieval modes: Local Search and Global Search. The results of both are passed to the LLM.

1. Local Graph Search

Steps:

  1. Compare query embedding with all node embeddings using cosine similarity.
  2. Select nodes whose similarity exceeds threshold τ_e.
  3. Retrieve the semantic chunks associated with those nodes.
  4. Compare the query embedding with each retrieved chunk using cosine similarity.
  5. Select the top-k most relevant chunks.

Local Search corresponds to entity-driven retrieval using the knowledge graph.

2. Global Graph Search

Steps:

  1. Compare the query embedding with community report embeddings.
  2. Select the top-k most relevant communities.
  3. Retrieve all chunks associated with the selected communities.
  4. Compare the query embedding with those chunks using cosine similarity.
  5. Rank them globally and return the top-k chunks.

Global Search captures higher-level semantic associations beyond individual entity matches.

3. Final Answer Generation

The system merges:

  • Top-k Local Search chunks
  • Top-k Global Search chunks
  • Selected community summaries

These are delivered as contextual evidence to the LLM along with the query. The LLM generates the final answer while grounding its output in the retrieved text fragments.


Project Structure

rag/
├── data/
│   ├── Ambedkar_works.pdf          # QA source document
│   └── processed/
│       ├── chunks.json
│       ├── knowledge_graph.pkl
│       └── community_summaries.json
│       └── embeddings.json
├── src/
│   ├── chunking/
│   │   ├── semantic_chunker.py
│   │   └── buffer_merger.py
│   ├── graph/
│   │   ├── entity_extractor.py
│   │   ├── graph_builder.py
│   │   ├── community_detector.py
│   │   └── summarizer.py
│   ├── retrieval/
│   │   ├── local_search.py
│   │   ├── global_search.py
│   │   └── ranker.py
│   ├── llm/
│   │   ├── llm_client.py
│   │   ├── prompt_templates.py
│   │   └── answer_generator.py
│   └── pipeline/
│       └── main.py                 # Main pipeline
├── tests/
└── config.yaml                     # Hyperparameters

Configuration Guide (config.yaml)

Each parameter in config.yaml controls a specific part of the SemRAG pipeline.

Chunking Parameters

Parameter Description
embedding_model SentenceTransformer model used throughout the pipeline.
buffer_size Number of adjacent sentences to merge around each sentence to preserve context.
theta Semantic distance threshold for chunk boundary detection.
T_max Maximum token limit of a chunk before it is recursively split.
overlap Overlap between sub-chunks during recursive splitting.

Knowledge Graph and Retrieval Parameters

Parameter Description
tau_e Threshold for entity-query similarity in Local Search.
tau_d Threshold for chunk-query similarity in Local Search.
k_local Number of chunks to return from Local Search.
k_global Number of chunks to return from Global Search.
top_k_communities Number of communities to explore in Global Search.

LLM and Storage Parameters

Parameter Description
ollama_host URL for the Ollama server.
ollama_model Model used by Ollama for summarization and answer generation.
llm_parallelism Number of parallel LLM calls for community summarization during index rebuild.
chunks_path Path to the serialized chunk results.
kg_path Path to the serialized knowledge graph.
community_path Path to the community summaries and mappings.
embeddings_path Path to the node and chunk embeddings.

LLM Parallelism

During index rebuilding, the system may generate community summaries in parallel using multiple LLM calls. To ensure stable execution, the number of parallel workers must not exceed Ollama’s request capacity:

llm_parallelism ≤ OLLAMA_NUM_PARALLEL

If this condition is violated, requests may queue internally, causing slower performance or GPU memory issues.
The optimal value depends on available GPU memory and model size—larger GPUs can safely support higher parallelism.

Set Ollama parallelism before starting the server:

Windows (PowerShell):

$env:OLLAMA_NUM_PARALLEL="2"
ollama serve

Linux

OLLAMA_NUM_PARALLEL=2 ollama serve

Match this value with llm_parallelism in config.yaml for best performance.


Setup Instructions

Follow these steps before running the system.

1. Ensure Python Environment

Recommended: Python 3.9+
Install dependencies:

pip install -r requirements.txt

2. Ollama Installation

Install Ollama from: https://ollama.com

Verify installation:

ollama --version

3. Download the Required Model

The given config uses the Mistral model:

ollama pull mistral

4. Install spaCy Model

python -m ensurepip --upgrade
python -m spacy download en_core_web_sm

5. Start Ollama Server

Ollama must be running in a separate terminal before executing the pipeline.

ollama serve

Building the Index (Initial Processing)

Run this once to prepare:

python -m rag.src.pipeline.main --rebuild-index

This performs:

  • Semantic chunking
  • Knowledge graph construction
  • Community detection
  • Community report generation
  • Embedding storage

For the provided demonstration PDF, this step has already been completed. Users who clone this repository and want to test the given PDF, only need to run the interactive mode.


Running the System (Interactive Mode)

Once the index is built:

python -m rag.src.pipeline.main --run

This launches an interactive loop:

  • User enters a question
  • System performs Local + Global SemRAG retrieval
  • System generates an LLM-grounded answer
  • User may continue asking or type exit to quit

Launch interactive loop along with the visibility of prompts being given to LLM:

python -m rag.src.pipeline.main --run --debug

Notes

  1. The index rebuild step may take several minutes to hours depending on hardware and length of the PDF.
  2. Ollama must be running at all times; otherwise, summarization and answer generation will fail.
  3. The system operates fully offline.

License and Use

This implementation is intended strictly for educational, evaluation, and research purposes to reproduce the SemRAG architecture for RAG-based question answering. Ensure compliance with licensing for all models and datasets used.


References

SemRAG: Semantic Knowledge-Augmented RAG for Improved Question-Answering
Authors: Kezhen Zhong, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
ArXiv preprint: https://arxiv.org/abs/2507.21110


About

A Reproduction and Practical Implementation of the SemRAG Research Paper Architecture

Topics

Resources

Stars

Watchers

Forks

Languages