QandAUsingLLM — Developer Guide 📚

A lightweight Streamlit-based project that ingests PDF documents, chunks them, computes embeddings, stores them in ChromaDB, and answers user queries using a retrieval-augmented generation (RAG) flow.

Quick summary

Main entry: main.py (starts the Streamlit UI)
Frontend: Webapp/Frontend/app.py (Streamlit app)
Core services: CoreServices/ (chunking, embeddings, retrieval, vector store)
Vector persistence: VectorStore/ (Chroma persistence path)

Prerequisites

Python 3.10+ (use a virtual environment)
Git (optional)

Install dependencies (from `requirements.txt`)

Create and activate a venv (Windows PowerShell):

python -m venv .venv
& .\.venv\Scripts\Activate.ps1

On macOS / Linux:

python -m venv .venv
source .venv/bin/activate

Install packages:

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Note: Some packages (like sentence-transformers, torch, chromadb, or PyMuPDF) may require additional system dependencies; refer to the corresponding package docs if installation fails.

Run the application

The project runs a Streamlit UI that accepts PDF uploads and exposes a text input for queries.

From repo root, run:

python main.py

This executes Streamlit via the main.py helper, which internally runs:

python -m streamlit run Webapp/Frontend/app.py

Or you can run Streamlit directly:

python -m streamlit run Webapp/Frontend/app.py

Open the browser URL shown by Streamlit (usually http://localhost:8501).

Environment variables & configuration

The app optionally reads several environment variables for model selection and secrets. Examples:

TRANSFORMER_MODEL_NAME — model used by EmbeddingManager in ProcessDocument (sentence-transformers name, e.g., all-MiniLM-L6-v2).
Embedding_Model_Name — used by ProcessSearchResults when creating embeddings for queries.
Model — LLM model name used when calling the LLM (default gpt-5 in code).

Azure Key Vault integration for retrieving a secret (e.g., OpenAI key):

KeyVault_Name
secret_name
AZURE_TENANT_ID
AZURE_CLIENT_ID
AZURE_CLIENT_SECRET

If you do not use Azure Key Vault, the LLM call will attempt to use an empty API key and may not work. Set environment variables using your shell or a .env loader.

Project structure (high level)

main.py — entry point that launches Streamlit
Webapp/Frontend/app.py — Streamlit UI (file upload, submit query)
CoreServices/ — core logic
- DocumentChunker.py — load PDFs & split into chunks
- EmbeddingManager.py — sentence-transformers wrapper
- VectoreStore.py — wrapper for adding docs to ChromaDB
- ChromaManager.py — manages Chroma collection and persistence
- Retriever.py — retrieval pipeline using embeddings + Chroma
- ProcessDocument.py — end-to-end pdf -> chunks -> embeddings -> store
- ProcessSearchResults.py — processes queries and calls LLM
- Models/schema.py — Pydantic models (e.g., PDF_Chunk)
VectorStore/ — Chroma persistence files
Webapp/Uploads/ — uploaded PDFs

Examples / Developer Usage

Run the full pipeline by uploading PDFs via Streamlit UI (the UI triggers ProcessDocument.process() after saving files).
Manually run document processing from Python:

from CoreServices.ProcessDocument import ProcessDocument
p = ProcessDocument()
p.process()

Query using the search flow (example):

from CoreServices.ProcessSearchResults import ProcessSearchResults
result = ProcessSearchResults(query="What skills are needed for X?", top_k=3)
print(result.process_query_results())

Convert DocumentChunker output into PDF_Chunk instances (two approaches):

If you keep metadata as a string in schema.PDF_Chunk:

import json
from CoreServices.Models.schema import PDF_Chunk
chunks = DocumentChunker.DocumentChunker().generate_pdf_chunks(DocumentChunker.DocumentChunker().load_pdf())
models = [PDF_Chunk(id=c['id'], text=c['text'], metadata=json.dumps(c['metadata']), source=c['source'], page_number=c['page_number']) for c in chunks]

Or prefer modifying the schema to accept dict metadata and then instantiate directly:

# change schema: metadata: dict[str, Any]
from CoreServices.Models.schema import PDF_Chunk
models = [PDF_Chunk(**c) for c in chunks]

Troubleshooting & Notes

Metadata mismatch: DocumentChunker.generate_pdf_chunks() returns metadata as a dict, while Models/schema.PDF_Chunk currently declares metadata: str. Consider updating the type to dict[str, Any] for clarity.
ProcessDocument.process() currently expects pdf_chunks to be objects with a .text attribute (it does generated_texts = [chunk.text for chunk in pdf_chunks]), while DocumentChunker produces list of dicts (chunk['text']). This can raise an attribute error. Workaround: modify ProcessDocument to use chunk['text'] or change the chunk output to dataclass objects.
ChromaDB data persists in the VectorStore folder (see ChromaManager). If collection is empty or not found, check permissions and confirm chromadb client initialization.
If Streamlit is not found when running main.py, ensure Streamlit is installed in the active environment: pip install streamlit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QandAUsingLLM — Developer Guide 📚

Quick summary

Prerequisites

Install dependencies (from `requirements.txt`)

Run the application

Environment variables & configuration

Project structure (high level)

Examples / Developer Usage

Troubleshooting & Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
CoreServices		CoreServices
Infrastructure		Infrastructure
Webapp/Frontend		Webapp/Frontend
.env		.env
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

meethardik/QandAUsingLLM

Folders and files

Latest commit

History

Repository files navigation

QandAUsingLLM — Developer Guide 📚

Quick summary

Prerequisites

Install dependencies (from requirements.txt)

Run the application

Environment variables & configuration

Project structure (high level)

Examples / Developer Usage

Troubleshooting & Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Install dependencies (from `requirements.txt`)

Packages