A lightweight REST API server powered by FastAPI & PyTorch.
It runs seamlessly on GPU (CUDA) if available, and safely falls back to CPU.
🚀 Includes a demo model, performance probes, a web control panel, and an extendable Plugin System for real AI models.
- ✅ Auto-selects GPU or CPU at runtime (
DEVICEin.env). - ✅ Clean FastAPI endpoints with Swagger UI (
/docs) and ReDoc (/redoc). - ✅ Interactive Control Panel (
/ui) & Model Size Calculator (/tools/model-size). - ✅ Example TinyNet model + benchmarks (
/matmul,/infer). - ✅ Extensible Plugin System with ready-to-use AI models.
- ✅ Works on Windows/Linux/macOS (CPU) & CUDA (NVIDIA GPUs).
app/
main.py # FastAPI app & endpoints
runtime.py # device selection, CUDA info, warmup
toy_model.py # TinyNet demo model
plugins/ # Modular plugins (bart, clip, resnet18, ner, etc.)
templates/ # HTML templates for UI
scripts/
install_torch.py # Auto-installs correct PyTorch (CPU/GPU)
test_api.py # Quick client to test endpoints
requirements.txt
README.md
Windows (PowerShell):
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1Linux/macOS:
python3.12 -m venv .venv
source .venv/bin/activatepip install -r requirements.txt
python -m scripts.install_torchuvicorn app.main:app --reload --host 0.0.0.0 --port 8000Open in browser:
- Swagger UI → http://localhost:8000/docs
- Control Panel → http://localhost:8000/ui
- Plugin Console → http://localhost:8000/plugins/ui
- Infer Client → http://localhost:8000/infer-client
# Prefer the first CUDA GPU if available
DEVICE=cuda:0
# Force CPU mode (override)
# DEVICE=cpu
# Warmup matrix size (for benchmarks)
WARMUP_MATMUL_SIZE=1024| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /cuda |
CUDA / device info |
| GET | /env |
Environment summary |
| GET | /env/full |
Full environment + GPU list |
| GET | /env/system |
OS/CPU/RAM system info |
| POST | /matmul |
Matrix multiply benchmark |
| POST | /infer |
TinyNet inference |
| POST | /inference |
Generic plugin inference |
| GET | /plugins |
List available plugins |
| GET | /ui |
Interactive control panel |
| GET | /tools/model-size |
Model size calculator (UI) |
NeuroServe uses a modular plugin system under app/plugins/.
Available plugins:
- bart → Text summarization (
facebook/bart-large-cnn) - clip → Text ↔ Image embeddings & similarity
- distilbert → Sentiment classification
- resnet18 → Image classification (ImageNet)
- ner → Named Entity Recognition
- tinyllama → Lightweight text generation
- translator / translator_m2m → Translation
- pdf_reader → Extract text from PDFs
- dummy → Ping test
- dichfoto_proxy → Forwarding proxy
🔧 Build your own Plugin (5 min)
from app.plugins.base import AIPlugin
class Plugin(AIPlugin):
tasks = ["hello"]
def load(self):
print("[plugin] hello ready")
def infer(self, payload):
name = payload.get("name", "world")
return {"task": "hello", "message": f"Hello, {name}!"}Add it in app/plugins/hello/ with manifest.json.
Matrix Multiply
curl -X POST http://localhost:8000/matmul -H "Content-Type: application/json" -d '{"n": 2048}'BART Summarization
curl -X POST http://localhost:8000/inference -H "Content-Type: application/json" -d '{"provider":"bart","task":"summarize","text":"Deep learning is a subfield..."}'NER Entity Extraction
curl -X POST http://localhost:8000/inference -H "Content-Type: application/json" -d '{"provider":"ner","task":"extract-entities","text":"Barack Obama was born in Hawaii."}'- Torch import error → Ensure Python 3.12 + rerun
python -m scripts.install_torch. - No GPU detected → Falls back to CPU. Force with
DEVICE=cpu. - CUDA mismatch → Reinstall torch with matching CUDA runtime.
- Out of Memory (OOM) → Reduce
max_lengthor use CPU mode. - Windows Execution Policy →
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
- File upload endpoints (images/text).
- Docker images (CPU & GPU).
- Extended plugin demo UI.
- CI/CD with pre-commit hooks & GitHub Actions.
- Real AI plugins (BART, CLIP, ResNet, DistilBERT, NER, etc.).
MIT © 2025 TamerOnLine