Skip to content

A high-performance, private RAG system for extracting structured technical insights and expertise from professional documents using MPNet embeddings, FAISS, and MMR diversity search.

Notifications You must be signed in to change notification settings

trahulkumar/Expertise-Insight-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Module

A high-performance Retrieval-Augmented Generation (RAG) system for extracting structured insights from personal documents (CVs, resumes, publications, and technical reports).

Key Components

  • Semantic Intelligence: powered by sentence-transformers/all-mpnet-base-v2 for superior context understanding.
  • Diverse Retrieval: Implements Maximal Marginal Relevance (MMR) to provide non-redundant, diverse information from complex documents.
  • Structured Extraction: Heuristic engine that categorizes findings into Technical Areas, Core Skills, Key Concepts, and Multi-word Phrases.
  • High Performance: Uses FAISS (Facebook AI Similarity Search) for blazingly fast local vector search.

Quick Start

# 1. Add your documents to data/personal_docs/
# 2. Run automated setup and ingestion
.\scripts\setup_and_run.bat

Folder Structure

rag/
├── src/                      # Core Logic
│   ├── ingest_documents.py   # Document ingestion & FAISS index builder
│   ├── rag_tool.py           # CrewAI-compatible extraction tool
│   ├── inspect_rag.py        # Utility to verify index quality
│   └── logging_config.py     # Centralized logging engine
├── config/                   # Configuration
│   └── config.yaml           # Extraction & Search parameters
├── scripts/                  # Automation
│   └── setup_and_run.bat     # One-click environment & data setup
├── docs/                     # Documentation
│   ├── SETUP.md              # Detailed installation guide
│   └── EXTERNAL_STORAGE.md   # Privacy & External storage guide
├── data/                     # Data Storage (gitignored)
│   ├── personal_docs/        # Input sources
│   └── vector_db/            # Local FAISS index
├── logs/                     # Execution logs
└── requirements.txt          # Dependency manifest

Features

  • Multi-format Support: PDF, DOCX, TXT, and MD ingestion.
  • Deterministic Metadata: Tracks source files and document types for every insight.
  • MMR Search: Prevents "information loops" by ensuring retrieved chunks are distinct.
  • Performance Optimized: Includes hf_xet for accelerated model downloads.
  • Local & Private: No external API calls are made for document processing or storage.

Requirements

  • Python 3.8+
  • UV Package Manager (recommended for 10x faster setup)
  • Windows (scripts optimized for PowerShell/CMD)

Documentation

About

A high-performance, private RAG system for extracting structured technical insights and expertise from professional documents using MPNet embeddings, FAISS, and MMR diversity search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published