Semantic Graph RAG: A SemRAG-Based Retrieval-Augmented Generation System

A Reproduction and Practical Implementation of the SemRAG Research Paper Architecture

Overview

This project is a Retrieval-Augmented Generation (RAG) system implementing the SemRAG architecture as described in the original paper. This system processes a provided PDF into semantic chunks, constructs a knowledge graph from extracted entities and relations, performs community detection, generates hierarchical community summaries, and executes both Local and Global Graph-based Retrieval when answering user queries. A local LLM (via Ollama) is used for both community report generation and final answer synthesis.

This implementation follows the SemRAG paper's methodology as closely as possible, including semantic chunking, graph construction, community hierarchy formation, dual retrieval strategy, and context fusion for LLM answering.

System Architecture

The system operates in two major phases:

Initial Processing (Index Building)
Query-Time Processing (Local + Global Search + Answer Generation)

Each stage is described below in detail.

Phase 1: Initial Processing

1. Semantic Chunking

Text is extracted from the provided PDF and transformed into semantically coherent chunks using the SemRAG methodology.

Steps:

Extract raw text from PDF pages.
Split text into sentences.
Buffer-merge neighboring sentences to preserve contextual continuity.
Convert merged sentences into embedding vectors.
Compute cosine similarity between adjacent sentence embeddings.
Group sentences into semantic chunks based on similarity thresholds.
Split oversized chunks using overlapping windows to maintain coherence.

The result is a collection of high-quality semantic chunks aligned with the SemRAG specification.

2. Knowledge Graph Construction

For each semantic chunk:

Entities are extracted using spaCy's NER model.
Relations are extracted using dependency parsing (subject–verb–object patterns).
Each entity becomes a node in the graph.
Relations create edges between nodes, annotated with relationship types and supporting chunk references.

All extracted nodes and edges from all chunks form the global Knowledge Graph representing the document’s semantic structure.

3. Community Detection and Community Report Generation

SemRAG emphasizes hierarchical community-based retrieval. This is reproduced as follows:

Apply a community detection algorithm (Louvain) on the knowledge graph.
Each community consists of semantically related entities.
For each community:
- Collect all nodes.
- Collect all edges and their supporting chunks.
- Collect a few sample chunks from each entity and relation for evidence.
Produce a community report by prompting a local LLM to summarize:
- Entities in the community
- Relations between them
- Supporting textual evidence

Each community report is embedded and stored for Global Search during query-time retrieval.

Phase 2: Query-Time Processing

When a user submits a question, the system embeds the query and executes two retrieval modes: Local Search and Global Search. The results of both are passed to the LLM.

1. Local Graph Search

Steps:

Compare query embedding with all node embeddings using cosine similarity.
Select nodes whose similarity exceeds threshold τ_e.
Retrieve the semantic chunks associated with those nodes.
Compare the query embedding with each retrieved chunk using cosine similarity.
Select the top-k most relevant chunks.

Local Search corresponds to entity-driven retrieval using the knowledge graph.

2. Global Graph Search

Steps:

Compare the query embedding with community report embeddings.
Select the top-k most relevant communities.
Retrieve all chunks associated with the selected communities.
Compare the query embedding with those chunks using cosine similarity.
Rank them globally and return the top-k chunks.

Global Search captures higher-level semantic associations beyond individual entity matches.

3. Final Answer Generation

The system merges:

Top-k Local Search chunks
Top-k Global Search chunks
Selected community summaries

These are delivered as contextual evidence to the LLM along with the query. The LLM generates the final answer while grounding its output in the retrieved text fragments.

Project Structure

rag/
├── data/
│   ├── Ambedkar_works.pdf          # QA source document
│   └── processed/
│       ├── chunks.json
│       ├── knowledge_graph.pkl
│       └── community_summaries.json
│       └── embeddings.json
├── src/
│   ├── chunking/
│   │   ├── semantic_chunker.py
│   │   └── buffer_merger.py
│   ├── graph/
│   │   ├── entity_extractor.py
│   │   ├── graph_builder.py
│   │   ├── community_detector.py
│   │   └── summarizer.py
│   ├── retrieval/
│   │   ├── local_search.py
│   │   ├── global_search.py
│   │   └── ranker.py
│   ├── llm/
│   │   ├── llm_client.py
│   │   ├── prompt_templates.py
│   │   └── answer_generator.py
│   └── pipeline/
│       └── main.py                 # Main pipeline
├── tests/
└── config.yaml                     # Hyperparameters

Configuration Guide (`config.yaml`)

Each parameter in config.yaml controls a specific part of the SemRAG pipeline.

Chunking Parameters

Parameter	Description
`embedding_model`	SentenceTransformer model used throughout the pipeline.
`buffer_size`	Number of adjacent sentences to merge around each sentence to preserve context.
`theta`	Semantic distance threshold for chunk boundary detection.
`T_max`	Maximum token limit of a chunk before it is recursively split.
`overlap`	Overlap between sub-chunks during recursive splitting.

Knowledge Graph and Retrieval Parameters

Parameter	Description
`tau_e`	Threshold for entity-query similarity in Local Search.
`tau_d`	Threshold for chunk-query similarity in Local Search.
`k_local`	Number of chunks to return from Local Search.
`k_global`	Number of chunks to return from Global Search.
`top_k_communities`	Number of communities to explore in Global Search.

LLM and Storage Parameters

Parameter	Description
`ollama_host`	URL for the Ollama server.
`ollama_model`	Model used by Ollama for summarization and answer generation.
`llm_parallelism`	Number of parallel LLM calls for community summarization during index rebuild.
`chunks_path`	Path to the serialized chunk results.
`kg_path`	Path to the serialized knowledge graph.
`community_path`	Path to the community summaries and mappings.
`embeddings_path`	Path to the node and chunk embeddings.

LLM Parallelism

During index rebuilding, the system may generate community summaries in parallel using multiple LLM calls. To ensure stable execution, the number of parallel workers must not exceed Ollama’s request capacity:
llm_parallelism ≤ OLLAMA_NUM_PARALLEL
If this condition is violated, requests may queue internally, causing slower performance or GPU memory issues.
The optimal value depends on available GPU memory and model size—larger GPUs can safely support higher parallelism.

Set Ollama parallelism before starting the server:

Windows (PowerShell):
$env:OLLAMA_NUM_PARALLEL="2"
ollama serve
Linux
OLLAMA_NUM_PARALLEL=2 ollama serve
Match this value with llm_parallelism in config.yaml for best performance.

Setup Instructions

Follow these steps before running the system.

1. Ensure Python Environment

Recommended: Python 3.9+
Install dependencies:

pip install -r requirements.txt

2. Ollama Installation

Install Ollama from: https://ollama.com

Verify installation:

ollama --version

3. Download the Required Model

The given config uses the Mistral model:

ollama pull mistral

4. Install spaCy Model

python -m ensurepip --upgrade
python -m spacy download en_core_web_sm

5. Start Ollama Server

Ollama must be running in a separate terminal before executing the pipeline.

ollama serve

Building the Index (Initial Processing)

Run this once to prepare:

python -m rag.src.pipeline.main --rebuild-index

This performs:

Semantic chunking
Knowledge graph construction
Community detection
Community report generation
Embedding storage

For the provided demonstration PDF, this step has already been completed. Users who clone this repository and want to test the given PDF, only need to run the interactive mode.

Running the System (Interactive Mode)

Once the index is built:

python -m rag.src.pipeline.main --run

This launches an interactive loop:

User enters a question
System performs Local + Global SemRAG retrieval
System generates an LLM-grounded answer
User may continue asking or type exit to quit

Launch interactive loop along with the visibility of prompts being given to LLM:

python -m rag.src.pipeline.main --run --debug

Notes

The index rebuild step may take several minutes to hours depending on hardware and length of the PDF.
Ollama must be running at all times; otherwise, summarization and answer generation will fail.
The system operates fully offline.

License and Use

This implementation is intended strictly for educational, evaluation, and research purposes to reproduce the SemRAG architecture for RAG-based question answering. Ensure compliance with licensing for all models and datasets used.

References

SemRAG: Semantic Knowledge-Augmented RAG for Improved Question-Answering
Authors: Kezhen Zhong, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
ArXiv preprint: https://arxiv.org/abs/2507.21110

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
rag		rag
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Graph RAG: A SemRAG-Based Retrieval-Augmented Generation System

Overview

System Architecture

Phase 1: Initial Processing

1. Semantic Chunking

2. Knowledge Graph Construction

3. Community Detection and Community Report Generation

Phase 2: Query-Time Processing

1. Local Graph Search

2. Global Graph Search

3. Final Answer Generation

Project Structure

Configuration Guide (`config.yaml`)

Chunking Parameters

Knowledge Graph and Retrieval Parameters

LLM and Storage Parameters

LLM Parallelism

Set Ollama parallelism before starting the server:

Setup Instructions

1. Ensure Python Environment

2. Ollama Installation

3. Download the Required Model

4. Install spaCy Model

5. Start Ollama Server

Building the Index (Initial Processing)

Running the System (Interactive Mode)

Notes

License and Use

References

About

Uh oh!

Languages

ratul-d/Semantic-Graph-RAG

Folders and files

Latest commit

History

Repository files navigation

Semantic Graph RAG: A SemRAG-Based Retrieval-Augmented Generation System

Overview

System Architecture

Phase 1: Initial Processing

1. Semantic Chunking

2. Knowledge Graph Construction

3. Community Detection and Community Report Generation

Phase 2: Query-Time Processing

1. Local Graph Search

2. Global Graph Search

3. Final Answer Generation

Project Structure

Configuration Guide (config.yaml)

Chunking Parameters

Knowledge Graph and Retrieval Parameters

LLM and Storage Parameters

LLM Parallelism

Set Ollama parallelism before starting the server:

Setup Instructions

1. Ensure Python Environment

2. Ollama Installation

3. Download the Required Model

4. Install spaCy Model

5. Start Ollama Server

Building the Index (Initial Processing)

Running the System (Interactive Mode)

Notes

License and Use

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Configuration Guide (`config.yaml`)