A high-performance Retrieval-Augmented Generation (RAG) system for extracting structured insights from personal documents (CVs, resumes, publications, and technical reports).
- Semantic Intelligence: powered by
sentence-transformers/all-mpnet-base-v2for superior context understanding. - Diverse Retrieval: Implements Maximal Marginal Relevance (MMR) to provide non-redundant, diverse information from complex documents.
- Structured Extraction: Heuristic engine that categorizes findings into Technical Areas, Core Skills, Key Concepts, and Multi-word Phrases.
- High Performance: Uses FAISS (Facebook AI Similarity Search) for blazingly fast local vector search.
# 1. Add your documents to data/personal_docs/
# 2. Run automated setup and ingestion
.\scripts\setup_and_run.batrag/
├── src/ # Core Logic
│ ├── ingest_documents.py # Document ingestion & FAISS index builder
│ ├── rag_tool.py # CrewAI-compatible extraction tool
│ ├── inspect_rag.py # Utility to verify index quality
│ └── logging_config.py # Centralized logging engine
├── config/ # Configuration
│ └── config.yaml # Extraction & Search parameters
├── scripts/ # Automation
│ └── setup_and_run.bat # One-click environment & data setup
├── docs/ # Documentation
│ ├── SETUP.md # Detailed installation guide
│ └── EXTERNAL_STORAGE.md # Privacy & External storage guide
├── data/ # Data Storage (gitignored)
│ ├── personal_docs/ # Input sources
│ └── vector_db/ # Local FAISS index
├── logs/ # Execution logs
└── requirements.txt # Dependency manifest
- ✓ Multi-format Support: PDF, DOCX, TXT, and MD ingestion.
- ✓ Deterministic Metadata: Tracks source files and document types for every insight.
- ✓ MMR Search: Prevents "information loops" by ensuring retrieved chunks are distinct.
- ✓ Performance Optimized: Includes
hf_xetfor accelerated model downloads. - ✓ Local & Private: No external API calls are made for document processing or storage.
- Python 3.8+
- UV Package Manager (recommended for 10x faster setup)
- Windows (scripts optimized for PowerShell/CMD)
- Setup Guide - Getting started from scratch.
- External Storage Guide - How to keep your data outside the Git repo.
- Retrieval Mechanics - Detailed look at how the extraction works.