This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.
The system reads news documents from the data/ folder, indexes them using LlamaIndex and FAISS, and generates embeddings with Ollama (e.g., nomic-embed-text). Incoming queries are answered at high speed by a model running on vLLM, such as microsoft/Phi-3-mini-4k-instruct.
- FastAPI (
api/): Asynchronous server that communicates with the external world. Provides the/api/v1/generateendpoint. - Lifespan Management: When the server starts (
@asynccontextmanager), the entire RAG pipeline (documents, index, vLLM engine) is loaded into memory. - Core Logic (
src/):src/data_loader.py: Reads.txtfiles from thedata/42bin_haber/newsfolder.src/indexing.py: Creates or loads theFAISSvector database from./storage.src/llm_engine.py: Isolated class that starts and manages thevLLMengine.src/pipeline.py:ProductionRAGclass; manages retrieve (find context), plan (create plan), and generate (produce answer) steps.
- NVIDIA GPU: A CUDA-enabled GPU is required for vLLM.
- Docker: Docker and NVIDIA Container Toolkit.
- Ollama: Ollama must be installed and running to execute the
nomic-embed-textmodel.
ollama pull nomic-embed-text- Clone the Repository:
git clone https://github.com/AbdulSametTurkmenoglu/vllm_rag_api.git
cd vllm_rag_api-
Add Data: Copy all your
.txtfiles from the42bin_haber/newsfolder todata/42bin_haber/news/. -
Virtual Environment and Libraries:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- .env File:
Copy the
.env.examplefile as.env. Make sureOLLAMA_BASE_URLis correct.
-
Add Data: Add your documents to the
data/folder. -
Build Docker Image:
docker build -t vllm-rag-api .- Run Docker Container:
--gpus all: GPU access for vLLM.-v ./data:/app/data: Mounts your localdatafolder to the container.-v ./storage:/app/storage: Mounts thestoragefolder for persistent index storage.--network host: Allows the container to access Ollama atlocalhost:11434.
docker run -d --gpus all -p 8000:8000 \
-v ./data:/app/data \
-v ./storage:/app/storage \
--network host \
--name rag_api \
vllm-rag-apiMake sure Ollama is running, then execute:
python run_server.pyThe server will start at http://0.0.0.0:8000.
To test if the RAG logic works without starting the API:
python run_local_test.pyWhile the server (local or Docker) is running, you can access the Swagger UI at http://localhost:8000/docs.
Example curl request:
curl -X 'POST' \
'http://localhost:8000/api/v1/generate' \
-H 'Content-Type: application/json' \
-d '{
"question": "What are the latest developments in Turkey'\''s economy?",
"max_tokens": 512,
"temperature": 0.5
}'