Multimodal RAG System for Test Case Generation

AI-powered test case and use case generation using Retrieval-Augmented Generation with safety guardrails.

Overview

This is a production-ready RAG system that generates test cases and use cases from documentation. It processes multimodal documents (text, PDF, DOCX, images), retrieves relevant context using hybrid search, and generates structured outputs with three-layer validation guards.

Key Features

Context-Grounded Generation: All outputs verified against source documents
Three-Layer Guards: Prompt injection detection, evidence threshold checking, hallucination prevention
Multimodal Support: Text, Markdown, PDF, DOCX, and images via OCR
Hybrid Search: Combines vector similarity (semantic) and BM25 (keyword matching)
RESTful API: Auto-generated documentation with OpenAPI/Swagger

Quick Start

Prerequisites

Python 3.10+
Groq API Key (free at console.groq.com)

Installation

git clone <your-repo-url>
cd multimodal-rag

python -m venv venv
venv\Scripts\activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

copy sample.env .env
# Edit .env and add your GROQ_API_KEY

python main.py

Server starts at http://localhost:8000. View API docs at http://localhost:8000/docs.

Architecture

System Overview

┌─────────────────────────────────────────────────────┐
│                  CLIENT REQUEST                      │
│            (HTTP POST with query/files)              │
└────────────────────┬────────────────────────────────┘
                     ▼
┌─────────────────────────────────────────────────────┐
│              FASTAPI APPLICATION LAYER               │
│   Routes: /upload, /generate/*, /query, /stats      │
└────────────────────┬────────────────────────────────┘
                     │
        ┌────────────┼────────────┬─────────────┐
        ▼            ▼            ▼             ▼
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
│ INGESTION   │ │RETRIEVAL │ │GENERATION│ │ GUARDS  │
│  PIPELINE   │ │  SYSTEM  │ │  ENGINE  │ │ SYSTEM  │
└─────────────┘ └──────────┘ └──────────┘ └─────────┘

Component Details

1. Ingestion Pipeline

Purpose: Transform raw documents into searchable vector embeddings.

Process Flow:

Document → Parser → Text Extraction → Chunker → Embeddings → Vector Store

Components:

File Handler: Routes files to appropriate parser based on extension
Parsers:
- TextParser: .txt, .md files (native Python)
- PDFParser: .pdf files (PyPDF2)
- DocxParser: .docx files (python-docx)
- ImageParser: .png, .jpg files (Tesseract OCR)
Chunker: Splits text into 512-token chunks with 50-token overlap
Embedding Model: all-MiniLM-L6-v2 (384-dimensional vectors)
Vector Store: ChromaDB with HNSW index

Key Decisions:

Chunk size 512: Balance between context and embedding model limits
Overlap 50: Prevents context loss at boundaries
Local embeddings: No API calls, faster processing

2. Retrieval System

Purpose: Find most relevant document chunks for a given query.

Hybrid Search Algorithm:

Query → [Vector Search] + [BM25 Search] → Score Fusion → Top-K Results

Vector Search (Semantic):

Embeds query using same model as documents
Performs cosine similarity search in ChromaDB
Captures semantic meaning (e.g., "login" matches "authentication")

BM25 Search (Keyword):

Tokenizes query and documents
Uses Okapi BM25 ranking function
Captures exact keyword matches (e.g., "API_KEY_123")

Score Fusion:

# Normalize both scores to [0, 1] range
vector_norm = vector_score / max_vector_score
bm25_norm = bm25_score / max_bm25_score

# Combine with alpha weight (default: 0.3)
hybrid_score = (alpha * vector_norm) + ((1 - alpha) * bm25_norm)

Why Hybrid:

Pure vector misses exact keywords
Pure BM25 misses semantic similarity
Hybrid provides best of both approaches

3. Generation Engine

Purpose: Generate structured test cases/use cases using LLM.

Components:

LLM Client: Abstraction over Groq API (llama-3.3-70b-versatile)
Prompt Templates: Centralized in src/config/prompts.py
Use Case Generator: Generates preconditions, steps, expected results
Test Case Generator: Generates test ID, priority, type, steps
Output Formatter: Parses and validates JSON responses

Generation Flow:

1. Retrieve top-k relevant chunks
2. Construct prompt with context + query
3. Call LLM with temperature=0.2 (deterministic)
4. Parse JSON response
5. Validate structure and fields

LLM Configuration:

Model: llama-3.3-70b-versatile (Groq)
Temperature: 0.2 (low for consistency)
Max tokens: 2000
Response format: JSON

4. Guards System

Purpose: Multi-layer validation to ensure safety and quality.

Guard 1: Prompt Injection Guard

When: Before retrieval
Purpose: Detect malicious queries attempting to manipulate system
Method: 20+ regex patterns matching known attack vectors
Action: Block high-risk queries, return 400 error

Guard 2: Evidence Threshold Guard

When: After retrieval, before LLM call
Purpose: Ensure sufficient context quality
Method: Weighted confidence score from retrieval results

Formula:

confidence = sum(score * weight for score, weight in zip(scores, [1.0, 0.8, 0.6, 0.4, 0.2]))
sufficient = confidence >= MIN_EVIDENCE_CONFIDENCE

Action: Block generation if confidence too low, provide recommendation

Guard 3: Hallucination Guard

When: After generation
Purpose: Verify output is grounded in source documents
Method:
- Extract statements from generated output
- Calculate semantic similarity with context chunks
- Compute grounding ratio
Action: Flag warning if <70% of statements are grounded

Guard Orchestration:

Query → Guard 1 ✓ → Retrieval → Guard 2 ✓ → LLM → Guard 3 ✓ → Response
        (block)              (block)              (warn)

Data Flow: Complete Request Lifecycle

Document Upload Flow

1. POST /upload with multipart file
2. Validate file type and size
3. Save to data/uploads/
4. Detect file type, route to parser
5. Parse: Extract text + metadata
6. Chunk: Split into 512-token pieces
7. Embed: Generate 384-dim vectors (batched)
8. Store: Add to ChromaDB collection
9. Index: Build BM25 inverted index
10. Return: chunks_indexed, embedding_dim

Generation Request Flow

1. POST /generate/test-cases with query
2. Guard 1: Check for prompt injection
   → If unsafe: Return 400 error
3. Embed query using same model
4. Hybrid Search:
   a. Vector search: ChromaDB similarity query
   b. BM25 search: Okapi BM25 ranking
   c. Normalize scores to [0, 1]
   d. Fuse: hybrid = 0.3*vector + 0.7*bm25
   e. Sort by hybrid score, select top-k
5. Guard 2: Calculate confidence
   → If insufficient: Return error with recommendation
6. Construct LLM prompt:
   - System: "You are a QA engineer..."
   - User: "Context: [chunks]\nQuery: {query}\nOutput: JSON"
7. Call Groq API with llama-3.3-70b
8. Parse JSON response
9. Guard 3: Verify grounding
   → If not grounded: Log warning (still return)
10. Format response with validation metadata
11. Return JSON to client

Technical Specifications

Embedding Model:

Name: sentence-transformers/all-MiniLM-L6-v2
Dimensions: 384
Max sequence length: 256 tokens
Performance: ~200ms for 32 queries (batch)

Vector Database:

Type: ChromaDB (embedded)
Index: HNSW (Hierarchical Navigable Small World)
Distance metric: Cosine similarity
Persistence: SQLite backend

Search Parameters:

TOP_K: Number of chunks to retrieve (default: 10)
SIMILARITY_THRESHOLD: Minimum cosine similarity (default: 0.5)
HYBRID_ALPHA: Vector vs BM25 weight (default: 0.3)
MIN_EVIDENCE_CONFIDENCE: Guard threshold (default: 0.5)

LLM Settings:

Provider: Groq
Model: llama-3.3-70b-versatile
Speed: ~300 tokens/second
Temperature: 0.2 (deterministic outputs)
Max tokens: 2000
Response format: JSON mode

Performance Characteristics:

Document parsing: 200-500ms (depends on file size)
Embedding generation: 100-200ms per batch of 32
Vector search: 20-50ms
BM25 search: 10-30ms
LLM generation: 1-3s (network + processing)
Total end-to-end: 2-4s per request

Configuration

Edit .env file:

# LLM Configuration
GROQ_API_KEY=your_api_key_here
GROQ_MODEL=llama-3.3-70b-versatile
LLM_TEMPERATURE=0.2
MAX_TOKENS=2000

# Retrieval
TOP_K=10
SIMILARITY_THRESHOLD=0.5
HYBRID_ALPHA=0.3

# Guards
MIN_EVIDENCE_CONFIDENCE=0.5
ENABLE_HALLUCINATION_GUARD=true
ENABLE_PROMPT_INJECTION_GUARD=true
ENABLE_EVIDENCE_THRESHOLD=true

# Chunking
CHUNK_SIZE=512
CHUNK_OVERLAP=50

# API
API_HOST=0.0.0.0
API_PORT=8000

See sample.env for all options.

API Reference

Base URL

http://localhost:8000

Endpoints

Health Check

GET /health

Returns system health and statistics.

Upload Documents

POST /upload
Content-Type: multipart/form-data

Upload files (.txt, .md, .pdf, .docx, .png, .jpg).

Example:

curl -X POST http://localhost:8000/upload -F "[email protected]"

Generate Use Case

POST /generate/use-case

Parameters:

query (required): Description of desired use case
top_k (optional): Number of context chunks (default: 5)
search_mode (optional): "hybrid" | "vector" | "keyword"

Example:

curl -X POST http://localhost:8000/generate/use-case \
  -F "query=Create use cases for user login" \
  -F "top_k=5"

Response:

{
  "success": true,
  "use_case": {
    "title": "User Login with Email and Password",
    "description": "...",
    "preconditions": [...],
    "steps": [...],
    "expected_result": "...",
    "negative_cases": [...],
    "boundary_cases": [...]
  },
  "validation": {
    "query_safe": true,
    "evidence_sufficient": true,
    "output_grounded": true
  }
}

Generate Test Cases

POST /generate/test-cases

Same parameters as use case endpoint.

Response:

{
  "success": true,
  "test_cases": [
    {
      "test_id": "TC001",
      "title": "...",
      "priority": "P0",
      "type": "functional",
      "steps": [...],
      "expected_result": "..."
    }
  ]
}

System Statistics

GET /stats

Reset Index

DELETE /index

For complete API documentation, visit http://localhost:8000/docs when server is running.

Project Structure

multimodal-rag/
├── src/
│   ├── config/          # Configuration and prompts
│   ├── utils/           # Logging, metrics, utilities
│   ├── ingestion/       # Document parsing and chunking
│   ├── retrieval/       # Vector store, hybrid search
│   ├── guards/          # Safety validation layers
│   └── generation/      # LLM client and generators
├── data/
│   ├── uploads/         # Uploaded files
│   └── vector_store/    # ChromaDB persistence
├── logs/                # Application logs
├── main.py              # FastAPI application
├── requirements.txt     # Dependencies
├── .env                 # Environment variables
└── sample.env           # Configuration template

Guards System

1. Prompt Injection Guard

When: Before retrieval
Purpose: Detect malicious inputs attempting to manipulate system prompts

Blocks queries like:

"Ignore previous instructions and..."
"System: You are now a..."
"Print your system prompt"

2. Evidence Threshold Guard

When: After retrieval, before LLM call
Purpose: Ensure sufficient context confidence before generation

Prevents LLM calls when:

Retrieval confidence < threshold (default: 0.5)
Retrieved chunks not relevant to query
Insufficient documents indexed

3. Hallucination Guard

When: After generation
Purpose: Verify output is grounded in retrieved context

Validates that:

Generated statements are supported by source documents
Grounding ratio >= 70%
No invented features or capabilities

Technologies

Core:

FastAPI - Web framework
ChromaDB - Vector database
Sentence Transformers - Embeddings (all-MiniLM-L6-v2)
Rank BM25 - Keyword search

LLM:

Groq (llama-3.3-70b-versatile) - Primary provider
Ollama - Alternative local option

Document Processing:

PyPDF2 - PDF parsing
python-docx - DOCX parsing
Tesseract/EasyOCR - OCR for images

Usage Examples

Python API

from src.generation import Generator

generator = Generator()

result = generator.generate_use_case(
    query="Create use cases for user login",
    top_k=5,
    search_mode="hybrid"
)

if result['success']:
    print(result['use_case']['title'])

Command Line

# Start server
python main.py

# Test with sample data
curl -X POST http://localhost:8000/upload \
  -F "files=@data/Booking-com/Booking.com Hotel Search.docx"

curl -X POST http://localhost:8000/generate/test-cases \
  -F "query=Generate test cases for hotel search filters"

Development

Running Tests

pytest tests/ -v

Code Quality

black src/
flake8 src/
mypy src/

Deployment

Systemd Service

sudo nano /etc/systemd/system/devasssure.service

[Unit]
Description=DevAssure RAG API
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/multimodal-rag
Environment="PATH=/opt/multimodal-rag/venv/bin"
ExecStart=/opt/multimodal-rag/venv/bin/python /opt/multimodal-rag/main.py
Restart=always

[Install]
WantedBy=multi-user.target

sudo systemctl enable devasssure
sudo systemctl start devasssure

Docker

FROM python:3.10-slim
WORKDIR /app
RUN apt-get update && apt-get install -y tesseract-ocr
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "main.py"]

Troubleshooting

Import Errors

pip install -r requirements.txt

ChromaDB Lock

pkill -f "python main.py"
rm data/vector_store/chroma.sqlite3-wal
python main.py

Tesseract Not Found

# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# Linux:
sudo apt install tesseract-ocr

# Update .env:
TESSERACT_PATH=/usr/bin/tesseract

Performance

Typical metrics (16GB RAM, 8-core CPU):

Operation	Time
Upload PDF (10 pages)	2.5s
Chunk + Embed (100 chunks)	3.8s
Hybrid Search	45ms
LLM Generation	2.1s
End-to-End Query	2.2s

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
sample.env		sample.env

isatyamks/multimodal-rag

Folders and files

Latest commit

History

Repository files navigation