Skip to content

meethardik/QandAUsingLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QandAUsingLLM — Developer Guide 📚

A lightweight Streamlit-based project that ingests PDF documents, chunks them, computes embeddings, stores them in ChromaDB, and answers user queries using a retrieval-augmented generation (RAG) flow.


Quick summary

  • Main entry: main.py (starts the Streamlit UI)
  • Frontend: Webapp/Frontend/app.py (Streamlit app)
  • Core services: CoreServices/ (chunking, embeddings, retrieval, vector store)
  • Vector persistence: VectorStore/ (Chroma persistence path)

Prerequisites

  • Python 3.10+ (use a virtual environment)
  • Git (optional)

Install dependencies (from requirements.txt)

  1. Create and activate a venv (Windows PowerShell):
python -m venv .venv
& .\.venv\Scripts\Activate.ps1

On macOS / Linux:

python -m venv .venv
source .venv/bin/activate
  1. Install packages:
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Note: Some packages (like sentence-transformers, torch, chromadb, or PyMuPDF) may require additional system dependencies; refer to the corresponding package docs if installation fails.


Run the application

The project runs a Streamlit UI that accepts PDF uploads and exposes a text input for queries.

  1. From repo root, run:
python main.py

This executes Streamlit via the main.py helper, which internally runs:

python -m streamlit run Webapp/Frontend/app.py

Or you can run Streamlit directly:

python -m streamlit run Webapp/Frontend/app.py

Open the browser URL shown by Streamlit (usually http://localhost:8501).


Environment variables & configuration

The app optionally reads several environment variables for model selection and secrets. Examples:

  • TRANSFORMER_MODEL_NAME — model used by EmbeddingManager in ProcessDocument (sentence-transformers name, e.g., all-MiniLM-L6-v2).
  • Embedding_Model_Name — used by ProcessSearchResults when creating embeddings for queries.
  • Model — LLM model name used when calling the LLM (default gpt-5 in code).

Azure Key Vault integration for retrieving a secret (e.g., OpenAI key):

  • KeyVault_Name
  • secret_name
  • AZURE_TENANT_ID
  • AZURE_CLIENT_ID
  • AZURE_CLIENT_SECRET

If you do not use Azure Key Vault, the LLM call will attempt to use an empty API key and may not work. Set environment variables using your shell or a .env loader.


Project structure (high level)

  • main.py — entry point that launches Streamlit
  • Webapp/Frontend/app.py — Streamlit UI (file upload, submit query)
  • CoreServices/ — core logic
    • DocumentChunker.py — load PDFs & split into chunks
    • EmbeddingManager.py — sentence-transformers wrapper
    • VectoreStore.py — wrapper for adding docs to ChromaDB
    • ChromaManager.py — manages Chroma collection and persistence
    • Retriever.py — retrieval pipeline using embeddings + Chroma
    • ProcessDocument.py — end-to-end pdf -> chunks -> embeddings -> store
    • ProcessSearchResults.py — processes queries and calls LLM
    • Models/schema.py — Pydantic models (e.g., PDF_Chunk)
  • VectorStore/ — Chroma persistence files
  • Webapp/Uploads/ — uploaded PDFs

Examples / Developer Usage

  • Run the full pipeline by uploading PDFs via Streamlit UI (the UI triggers ProcessDocument.process() after saving files).

  • Manually run document processing from Python:

from CoreServices.ProcessDocument import ProcessDocument
p = ProcessDocument()
p.process()
  • Query using the search flow (example):
from CoreServices.ProcessSearchResults import ProcessSearchResults
result = ProcessSearchResults(query="What skills are needed for X?", top_k=3)
print(result.process_query_results())
  • Convert DocumentChunker output into PDF_Chunk instances (two approaches):
  1. If you keep metadata as a string in schema.PDF_Chunk:
import json
from CoreServices.Models.schema import PDF_Chunk
chunks = DocumentChunker.DocumentChunker().generate_pdf_chunks(DocumentChunker.DocumentChunker().load_pdf())
models = [PDF_Chunk(id=c['id'], text=c['text'], metadata=json.dumps(c['metadata']), source=c['source'], page_number=c['page_number']) for c in chunks]
  1. Or prefer modifying the schema to accept dict metadata and then instantiate directly:
# change schema: metadata: dict[str, Any]
from CoreServices.Models.schema import PDF_Chunk
models = [PDF_Chunk(**c) for c in chunks]

Troubleshooting & Notes

  • Metadata mismatch: DocumentChunker.generate_pdf_chunks() returns metadata as a dict, while Models/schema.PDF_Chunk currently declares metadata: str. Consider updating the type to dict[str, Any] for clarity.

  • ProcessDocument.process() currently expects pdf_chunks to be objects with a .text attribute (it does generated_texts = [chunk.text for chunk in pdf_chunks]), while DocumentChunker produces list of dicts (chunk['text']). This can raise an attribute error. Workaround: modify ProcessDocument to use chunk['text'] or change the chunk output to dataclass objects.

  • ChromaDB data persists in the VectorStore folder (see ChromaManager). If collection is empty or not found, check permissions and confirm chromadb client initialization.

  • If Streamlit is not found when running main.py, ensure Streamlit is installed in the active environment: pip install streamlit.

About

building a CPU-Only "PDF Q&A System" using hugging face, chromaDB vector search, and Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages