Production-Ready vLLM RAG API

This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.

The system reads news documents from the data/ folder, indexes them using LlamaIndex and FAISS, and generates embeddings with Ollama (e.g., nomic-embed-text). Incoming queries are answered at high speed by a model running on vLLM, such as microsoft/Phi-3-mini-4k-instruct.

Architecture

FastAPI (api/): Asynchronous server that communicates with the external world. Provides the /api/v1/generate endpoint.
Lifespan Management: When the server starts (@asynccontextmanager), the entire RAG pipeline (documents, index, vLLM engine) is loaded into memory.
Core Logic (src/):
- src/data_loader.py: Reads .txt files from the data/42bin_haber/news folder.
- src/indexing.py: Creates or loads the FAISS vector database from ./storage.
- src/llm_engine.py: Isolated class that starts and manages the vLLM engine.
- src/pipeline.py: ProductionRAG class; manages retrieve (find context), plan (create plan), and generate (produce answer) steps.

Installation

1. Prerequisites

NVIDIA GPU: A CUDA-enabled GPU is required for vLLM.
Docker: Docker and NVIDIA Container Toolkit.
Ollama: Ollama must be installed and running to execute the nomic-embed-text model.

    ollama pull nomic-embed-text

2. Local Installation

Clone the Repository:

    git clone https://github.com/AbdulSametTurkmenoglu/vllm_rag_api.git
    cd vllm_rag_api

Add Data: Copy all your .txt files from the 42bin_haber/news folder to data/42bin_haber/news/.
Virtual Environment and Libraries:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

.env File: Copy the .env.example file as .env. Make sure OLLAMA_BASE_URL is correct.

3. Docker Installation (Recommended)

Add Data: Add your documents to the data/ folder.
Build Docker Image:

    docker build -t vllm-rag-api .

Run Docker Container:
- --gpus all: GPU access for vLLM.
- -v ./data:/app/data: Mounts your local data folder to the container.
- -v ./storage:/app/storage: Mounts the storage folder for persistent index storage.
- --network host: Allows the container to access Ollama at localhost:11434.

    docker run -d --gpus all -p 8000:8000 \
      -v ./data:/app/data \
      -v ./storage:/app/storage \
      --network host \
      --name rag_api \
      vllm-rag-api

Usage

1. Start the API Server (Local)

Make sure Ollama is running, then execute:

python run_server.py

The server will start at http://0.0.0.0:8000.

2. Test the RAG Core (Local)

To test if the RAG logic works without starting the API:

python run_local_test.py

3. Using the API

While the server (local or Docker) is running, you can access the Swagger UI at http://localhost:8000/docs.

Example curl request:

curl -X 'POST' \
  'http://localhost:8000/api/v1/generate' \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "What are the latest developments in Turkey'\''s economy?",
    "max_tokens": 512,
    "temperature": 0.5
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
api		api
src		src
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
run_local_test.py		run_local_test.py
run_server.py		run_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Production-Ready vLLM RAG API

Architecture

Installation

1. Prerequisites

2. Local Installation

3. Docker Installation (Recommended)

Usage

1. Start the API Server (Local)

2. Test the RAG Core (Local)

3. Using the API

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AbdulSametTurkmenoglu/vllm_rag_api

Folders and files

Latest commit

History

Repository files navigation

Production-Ready vLLM RAG API

Architecture

Installation

1. Prerequisites

2. Local Installation

3. Docker Installation (Recommended)

Usage

1. Start the API Server (Local)

2. Test the RAG Core (Local)

3. Using the API

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages