A Reproduction and Practical Implementation of the SemRAG Research Paper Architecture
This project is a Retrieval-Augmented Generation (RAG) system implementing the SemRAG architecture as described in the original paper. This system processes a provided PDF into semantic chunks, constructs a knowledge graph from extracted entities and relations, performs community detection, generates hierarchical community summaries, and executes both Local and Global Graph-based Retrieval when answering user queries. A local LLM (via Ollama) is used for both community report generation and final answer synthesis.
This implementation follows the SemRAG paper's methodology as closely as possible, including semantic chunking, graph construction, community hierarchy formation, dual retrieval strategy, and context fusion for LLM answering.
The system operates in two major phases:
- Initial Processing (Index Building)
- Query-Time Processing (Local + Global Search + Answer Generation)
Each stage is described below in detail.
Text is extracted from the provided PDF and transformed into semantically coherent chunks using the SemRAG methodology.
Steps:
- Extract raw text from PDF pages.
- Split text into sentences.
- Buffer-merge neighboring sentences to preserve contextual continuity.
- Convert merged sentences into embedding vectors.
- Compute cosine similarity between adjacent sentence embeddings.
- Group sentences into semantic chunks based on similarity thresholds.
- Split oversized chunks using overlapping windows to maintain coherence.
The result is a collection of high-quality semantic chunks aligned with the SemRAG specification.
For each semantic chunk:
- Entities are extracted using spaCy's NER model.
- Relations are extracted using dependency parsing (subject–verb–object patterns).
- Each entity becomes a node in the graph.
- Relations create edges between nodes, annotated with relationship types and supporting chunk references.
All extracted nodes and edges from all chunks form the global Knowledge Graph representing the document’s semantic structure.
SemRAG emphasizes hierarchical community-based retrieval. This is reproduced as follows:
-
Apply a community detection algorithm (Louvain) on the knowledge graph.
-
Each community consists of semantically related entities.
-
For each community:
- Collect all nodes.
- Collect all edges and their supporting chunks.
- Collect a few sample chunks from each entity and relation for evidence.
-
Produce a community report by prompting a local LLM to summarize:
- Entities in the community
- Relations between them
- Supporting textual evidence
Each community report is embedded and stored for Global Search during query-time retrieval.
When a user submits a question, the system embeds the query and executes two retrieval modes: Local Search and Global Search. The results of both are passed to the LLM.
Steps:
- Compare query embedding with all node embeddings using cosine similarity.
- Select nodes whose similarity exceeds threshold τ_e.
- Retrieve the semantic chunks associated with those nodes.
- Compare the query embedding with each retrieved chunk using cosine similarity.
- Select the top-k most relevant chunks.
Local Search corresponds to entity-driven retrieval using the knowledge graph.
Steps:
- Compare the query embedding with community report embeddings.
- Select the top-k most relevant communities.
- Retrieve all chunks associated with the selected communities.
- Compare the query embedding with those chunks using cosine similarity.
- Rank them globally and return the top-k chunks.
Global Search captures higher-level semantic associations beyond individual entity matches.
The system merges:
- Top-k Local Search chunks
- Top-k Global Search chunks
- Selected community summaries
These are delivered as contextual evidence to the LLM along with the query. The LLM generates the final answer while grounding its output in the retrieved text fragments.
rag/
├── data/
│ ├── Ambedkar_works.pdf # QA source document
│ └── processed/
│ ├── chunks.json
│ ├── knowledge_graph.pkl
│ └── community_summaries.json
│ └── embeddings.json
├── src/
│ ├── chunking/
│ │ ├── semantic_chunker.py
│ │ └── buffer_merger.py
│ ├── graph/
│ │ ├── entity_extractor.py
│ │ ├── graph_builder.py
│ │ ├── community_detector.py
│ │ └── summarizer.py
│ ├── retrieval/
│ │ ├── local_search.py
│ │ ├── global_search.py
│ │ └── ranker.py
│ ├── llm/
│ │ ├── llm_client.py
│ │ ├── prompt_templates.py
│ │ └── answer_generator.py
│ └── pipeline/
│ └── main.py # Main pipeline
├── tests/
└── config.yaml # Hyperparameters
Each parameter in config.yaml controls a specific part of the SemRAG pipeline.
| Parameter | Description |
|---|---|
embedding_model |
SentenceTransformer model used throughout the pipeline. |
buffer_size |
Number of adjacent sentences to merge around each sentence to preserve context. |
theta |
Semantic distance threshold for chunk boundary detection. |
T_max |
Maximum token limit of a chunk before it is recursively split. |
overlap |
Overlap between sub-chunks during recursive splitting. |
| Parameter | Description |
|---|---|
tau_e |
Threshold for entity-query similarity in Local Search. |
tau_d |
Threshold for chunk-query similarity in Local Search. |
k_local |
Number of chunks to return from Local Search. |
k_global |
Number of chunks to return from Global Search. |
top_k_communities |
Number of communities to explore in Global Search. |
| Parameter | Description |
|---|---|
ollama_host |
URL for the Ollama server. |
ollama_model |
Model used by Ollama for summarization and answer generation. |
llm_parallelism |
Number of parallel LLM calls for community summarization during index rebuild. |
chunks_path |
Path to the serialized chunk results. |
kg_path |
Path to the serialized knowledge graph. |
community_path |
Path to the community summaries and mappings. |
embeddings_path |
Path to the node and chunk embeddings. |
During index rebuilding, the system may generate community summaries in parallel using multiple LLM calls. To ensure stable execution, the number of parallel workers must not exceed Ollama’s request capacity:
llm_parallelism ≤ OLLAMA_NUM_PARALLELIf this condition is violated, requests may queue internally, causing slower performance or GPU memory issues.
The optimal value depends on available GPU memory and model size—larger GPUs can safely support higher parallelism.Windows (PowerShell):
$env:OLLAMA_NUM_PARALLEL="2" ollama serveLinux
OLLAMA_NUM_PARALLEL=2 ollama serveMatch this value with
llm_parallelisminconfig.yamlfor best performance.
Follow these steps before running the system.
Recommended: Python 3.9+
Install dependencies:
pip install -r requirements.txt
Install Ollama from: https://ollama.com
Verify installation:
ollama --versionThe given config uses the Mistral model:
ollama pull mistralpython -m ensurepip --upgrade
python -m spacy download en_core_web_sm
Ollama must be running in a separate terminal before executing the pipeline.
ollama serve
Run this once to prepare:
python -m rag.src.pipeline.main --rebuild-index
This performs:
- Semantic chunking
- Knowledge graph construction
- Community detection
- Community report generation
- Embedding storage
For the provided demonstration PDF, this step has already been completed. Users who clone this repository and want to test the given PDF, only need to run the interactive mode.
Once the index is built:
python -m rag.src.pipeline.main --run
This launches an interactive loop:
- User enters a question
- System performs Local + Global SemRAG retrieval
- System generates an LLM-grounded answer
- User may continue asking or type exit to quit
Launch interactive loop along with the visibility of prompts being given to LLM:
python -m rag.src.pipeline.main --run --debug
- The index rebuild step may take several minutes to hours depending on hardware and length of the PDF.
- Ollama must be running at all times; otherwise, summarization and answer generation will fail.
- The system operates fully offline.
This implementation is intended strictly for educational, evaluation, and research purposes to reproduce the SemRAG architecture for RAG-based question answering. Ensure compliance with licensing for all models and datasets used.
SemRAG: Semantic Knowledge-Augmented RAG for Improved Question-Answering
Authors: Kezhen Zhong, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
ArXiv preprint: https://arxiv.org/abs/2507.21110