Skip to content

This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.

Notifications You must be signed in to change notification settings

AbdulSametTurkmenoglu/vllm_rag_api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production-Ready vLLM RAG API

This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.

The system reads news documents from the data/ folder, indexes them using LlamaIndex and FAISS, and generates embeddings with Ollama (e.g., nomic-embed-text). Incoming queries are answered at high speed by a model running on vLLM, such as microsoft/Phi-3-mini-4k-instruct.

Architecture

  1. FastAPI (api/): Asynchronous server that communicates with the external world. Provides the /api/v1/generate endpoint.
  2. Lifespan Management: When the server starts (@asynccontextmanager), the entire RAG pipeline (documents, index, vLLM engine) is loaded into memory.
  3. Core Logic (src/):
    • src/data_loader.py: Reads .txt files from the data/42bin_haber/news folder.
    • src/indexing.py: Creates or loads the FAISS vector database from ./storage.
    • src/llm_engine.py: Isolated class that starts and manages the vLLM engine.
    • src/pipeline.py: ProductionRAG class; manages retrieve (find context), plan (create plan), and generate (produce answer) steps.

Installation

1. Prerequisites

  • NVIDIA GPU: A CUDA-enabled GPU is required for vLLM.
  • Docker: Docker and NVIDIA Container Toolkit.
  • Ollama: Ollama must be installed and running to execute the nomic-embed-text model.
    ollama pull nomic-embed-text

2. Local Installation

  1. Clone the Repository:
    git clone https://github.com/AbdulSametTurkmenoglu/vllm_rag_api.git
    cd vllm_rag_api
  1. Add Data: Copy all your .txt files from the 42bin_haber/news folder to data/42bin_haber/news/.

  2. Virtual Environment and Libraries:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  1. .env File: Copy the .env.example file as .env. Make sure OLLAMA_BASE_URL is correct.

3. Docker Installation (Recommended)

  1. Add Data: Add your documents to the data/ folder.

  2. Build Docker Image:

    docker build -t vllm-rag-api .
  1. Run Docker Container:
    • --gpus all: GPU access for vLLM.
    • -v ./data:/app/data: Mounts your local data folder to the container.
    • -v ./storage:/app/storage: Mounts the storage folder for persistent index storage.
    • --network host: Allows the container to access Ollama at localhost:11434.
    docker run -d --gpus all -p 8000:8000 \
      -v ./data:/app/data \
      -v ./storage:/app/storage \
      --network host \
      --name rag_api \
      vllm-rag-api

Usage

1. Start the API Server (Local)

Make sure Ollama is running, then execute:

python run_server.py

The server will start at http://0.0.0.0:8000.

2. Test the RAG Core (Local)

To test if the RAG logic works without starting the API:

python run_local_test.py

3. Using the API

While the server (local or Docker) is running, you can access the Swagger UI at http://localhost:8000/docs.

Example curl request:

curl -X 'POST' \
  'http://localhost:8000/api/v1/generate' \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "What are the latest developments in Turkey'\''s economy?",
    "max_tokens": 512,
    "temperature": 0.5
  }'

About

This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •