RAG Module

A high-performance Retrieval-Augmented Generation (RAG) system for extracting structured insights from personal documents (CVs, resumes, publications, and technical reports).

Key Components

Semantic Intelligence: powered by sentence-transformers/all-mpnet-base-v2 for superior context understanding.
Diverse Retrieval: Implements Maximal Marginal Relevance (MMR) to provide non-redundant, diverse information from complex documents.
Structured Extraction: Heuristic engine that categorizes findings into Technical Areas, Core Skills, Key Concepts, and Multi-word Phrases.
High Performance: Uses FAISS (Facebook AI Similarity Search) for blazingly fast local vector search.

Quick Start

# 1. Add your documents to data/personal_docs/
# 2. Run automated setup and ingestion
.\scripts\setup_and_run.bat

Folder Structure

rag/
├── src/                      # Core Logic
│   ├── ingest_documents.py   # Document ingestion & FAISS index builder
│   ├── rag_tool.py           # CrewAI-compatible extraction tool
│   ├── inspect_rag.py        # Utility to verify index quality
│   └── logging_config.py     # Centralized logging engine
├── config/                   # Configuration
│   └── config.yaml           # Extraction & Search parameters
├── scripts/                  # Automation
│   └── setup_and_run.bat     # One-click environment & data setup
├── docs/                     # Documentation
│   ├── SETUP.md              # Detailed installation guide
│   └── EXTERNAL_STORAGE.md   # Privacy & External storage guide
├── data/                     # Data Storage (gitignored)
│   ├── personal_docs/        # Input sources
│   └── vector_db/            # Local FAISS index
├── logs/                     # Execution logs
└── requirements.txt          # Dependency manifest

Features

✓ Multi-format Support: PDF, DOCX, TXT, and MD ingestion.
✓ Deterministic Metadata: Tracks source files and document types for every insight.
✓ MMR Search: Prevents "information loops" by ensuring retrieved chunks are distinct.
✓ Performance Optimized: Includes hf_xet for accelerated model downloads.
✓ Local & Private: No external API calls are made for document processing or storage.

Requirements

Python 3.8+
UV Package Manager (recommended for 10x faster setup)
Windows (scripts optimized for PowerShell/CMD)

Documentation

Setup Guide - Getting started from scratch.
External Storage Guide - How to keep your data outside the Git repo.
Retrieval Mechanics - Detailed look at how the extraction works.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG Module

Key Components

Quick Start

Folder Structure

Features

Requirements

Documentation

About

Uh oh!

Releases

Packages

Languages

trahulkumar/Expertise-Insight-RAG

Folders and files

Latest commit

History

Repository files navigation

RAG Module

Key Components

Quick Start

Folder Structure

Features

Requirements

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages