A lightweight Streamlit-based project that ingests PDF documents, chunks them, computes embeddings, stores them in ChromaDB, and answers user queries using a retrieval-augmented generation (RAG) flow.
- Main entry:
main.py(starts the Streamlit UI) - Frontend:
Webapp/Frontend/app.py(Streamlit app) - Core services:
CoreServices/(chunking, embeddings, retrieval, vector store) - Vector persistence:
VectorStore/(Chroma persistence path)
- Python 3.10+ (use a virtual environment)
- Git (optional)
- Create and activate a venv (Windows PowerShell):
python -m venv .venv
& .\.venv\Scripts\Activate.ps1On macOS / Linux:
python -m venv .venv
source .venv/bin/activate- Install packages:
python -m pip install --upgrade pip
python -m pip install -r requirements.txtNote: Some packages (like sentence-transformers, torch, chromadb, or PyMuPDF) may require additional system dependencies; refer to the corresponding package docs if installation fails.
The project runs a Streamlit UI that accepts PDF uploads and exposes a text input for queries.
- From repo root, run:
python main.pyThis executes Streamlit via the main.py helper, which internally runs:
python -m streamlit run Webapp/Frontend/app.pyOr you can run Streamlit directly:
python -m streamlit run Webapp/Frontend/app.pyOpen the browser URL shown by Streamlit (usually http://localhost:8501).
The app optionally reads several environment variables for model selection and secrets. Examples:
TRANSFORMER_MODEL_NAME— model used byEmbeddingManagerinProcessDocument(sentence-transformers name, e.g.,all-MiniLM-L6-v2).Embedding_Model_Name— used byProcessSearchResultswhen creating embeddings for queries.Model— LLM model name used when calling the LLM (defaultgpt-5in code).
Azure Key Vault integration for retrieving a secret (e.g., OpenAI key):
KeyVault_Namesecret_nameAZURE_TENANT_IDAZURE_CLIENT_IDAZURE_CLIENT_SECRET
If you do not use Azure Key Vault, the LLM call will attempt to use an empty API key and may not work. Set environment variables using your shell or a .env loader.
main.py— entry point that launches StreamlitWebapp/Frontend/app.py— Streamlit UI (file upload, submit query)CoreServices/— core logicDocumentChunker.py— load PDFs & split into chunksEmbeddingManager.py— sentence-transformers wrapperVectoreStore.py— wrapper for adding docs to ChromaDBChromaManager.py— manages Chroma collection and persistenceRetriever.py— retrieval pipeline using embeddings + ChromaProcessDocument.py— end-to-end pdf -> chunks -> embeddings -> storeProcessSearchResults.py— processes queries and calls LLMModels/schema.py— Pydantic models (e.g.,PDF_Chunk)
VectorStore/— Chroma persistence filesWebapp/Uploads/— uploaded PDFs
-
Run the full pipeline by uploading PDFs via Streamlit UI (the UI triggers
ProcessDocument.process()after saving files). -
Manually run document processing from Python:
from CoreServices.ProcessDocument import ProcessDocument
p = ProcessDocument()
p.process()- Query using the search flow (example):
from CoreServices.ProcessSearchResults import ProcessSearchResults
result = ProcessSearchResults(query="What skills are needed for X?", top_k=3)
print(result.process_query_results())- Convert
DocumentChunkeroutput intoPDF_Chunkinstances (two approaches):
- If you keep
metadataas a string inschema.PDF_Chunk:
import json
from CoreServices.Models.schema import PDF_Chunk
chunks = DocumentChunker.DocumentChunker().generate_pdf_chunks(DocumentChunker.DocumentChunker().load_pdf())
models = [PDF_Chunk(id=c['id'], text=c['text'], metadata=json.dumps(c['metadata']), source=c['source'], page_number=c['page_number']) for c in chunks]- Or prefer modifying the schema to accept dict metadata and then instantiate directly:
# change schema: metadata: dict[str, Any]
from CoreServices.Models.schema import PDF_Chunk
models = [PDF_Chunk(**c) for c in chunks]-
Metadata mismatch:
DocumentChunker.generate_pdf_chunks()returnsmetadataas a dict, whileModels/schema.PDF_Chunkcurrently declaresmetadata: str. Consider updating the type todict[str, Any]for clarity. -
ProcessDocument.process()currently expectspdf_chunksto be objects with a.textattribute (it doesgenerated_texts = [chunk.text for chunk in pdf_chunks]), whileDocumentChunkerproduces list of dicts (chunk['text']). This can raise an attribute error. Workaround: modifyProcessDocumentto usechunk['text']or change the chunk output to dataclass objects. -
ChromaDB data persists in the
VectorStorefolder (seeChromaManager). If collection is empty or not found, check permissions and confirmchromadbclient initialization. -
If Streamlit is not found when running
main.py, ensure Streamlit is installed in the active environment:pip install streamlit.