mlx-openai-server

A high-performance OpenAI-compatible API server for MLX models. Run text, vision, audio, and image generation models locally on Apple Silicon with a drop-in OpenAI replacement.

Note: Requires macOS with M-series chips (MLX is optimized for Apple Silicon).

Key Features

🚀 OpenAI-compatible API - Drop-in replacement for OpenAI services
🖼️ Multimodal support - Text, vision, audio, and image generation/editing
🎨 Flux-series models - Image generation (schnell, dev, krea-dev, flux-2-klein) and editing (kontext, qwen-image-edit)
🔌 Easy integration - Works with existing OpenAI client libraries
⚡ Performance - Configurable quantization (4/8/16-bit) and context length
🎛️ LoRA adapters - Fine-tuned image generation and editing
📈 Queue management - Built-in request queuing and monitoring

Installation

Prerequisites

macOS with Apple Silicon (M-series)
Python 3.11+

Quick Install

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install from PyPI
pip install mlx-openai-server

# Or install from GitHub
pip install git+https://github.com/cubist38/mlx-openai-server.git

Optional: Whisper Support

For audio transcription models, install ffmpeg:

brew install ffmpeg

Quick Start

Start the Server

# Text-only or multimodal models
mlx-openai-server launch \
  --model-path <path-to-mlx-model> \
  --model-type <lm|multimodal>

# Image generation (Flux-series)
mlx-openai-server launch \
  --model-type image-generation \
  --model-path <path-to-flux-model> \
  --config-name flux-dev \
  --quantize 8

# Image editing
mlx-openai-server launch \
  --model-type image-edit \
  --model-path <path-to-flux-model> \
  --config-name flux-kontext-dev \
  --quantize 8

# Embeddings
mlx-openai-server launch \
  --model-type embeddings \
  --model-path <embeddings-model-path>

# Whisper (audio transcription)
mlx-openai-server launch \
  --model-type whisper \
  --model-path mlx-community/whisper-large-v3-mlx

Server Parameters

--model-path: Path to MLX model (local or HuggingFace repo)
--model-type: lm, multimodal, image-generation, image-edit, embeddings, or whisper
--config-name: For image models - flux-schnell, flux-dev, flux-krea-dev, flux-kontext-dev, flux2-klein-4b, flux2-klein-9b, qwen-image, qwen-image-edit, z-image-turbo, fibo
--quantize: Quantization level - 4, 8, or 16 (image models)
--context-length: Max sequence length for memory optimization
--max-concurrency: Concurrent requests (default: 1)
--queue-timeout: Request timeout in seconds (default: 300)
--lora-paths: Comma-separated LoRA adapter paths (image models)
--lora-scales: Comma-separated LoRA scales (must match paths)
--log-level: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
--no-log-file: Disable file logging (console only)

Supported Model Types

Text-only (lm) - Language models via mlx-lm
Multimodal (multimodal) - Text, images, audio via mlx-vlm
Image generation (image-generation) - Flux-series, Qwen Image, Z-Image Turbo, Fibo
Image editing (image-edit) - Flux kontext, Qwen Image Edit
Embeddings (embeddings) - Text embeddings via mlx-embeddings
Whisper (whisper) - Audio transcription (requires ffmpeg)

Image Model Configurations

Generation:

flux-schnell - Fast (4 steps, no guidance)
flux-dev - Balanced (25 steps, 3.5 guidance)
flux-krea-dev - High quality (28 steps, 4.5 guidance)
flux2-klein-4b / flux2-klein-9b - Flux 2 Klein models
qwen-image - Qwen image generation (50 steps, 4.0 guidance)
z-image-turbo - Z-Image Turbo
fibo - Fibo model

Editing:

flux-kontext-dev - Context-aware editing (28 steps, 2.5 guidance)
flux2-klein-edit-4b / flux2-klein-edit-9b - Flux 2 Klein editing
qwen-image-edit - Qwen image editing (50 steps, 4.0 guidance)

Using the API

The server provides OpenAI-compatible endpoints. Use standard OpenAI client libraries:

Text Completion

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

Vision (Multimodal)

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.jpg", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode('utf-8')

response = client.chat.completions.create(
    model="local-multimodal",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        ]
    }]
)
print(response.choices[0].message.content)

Image Generation

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.images.generate(
    prompt="A serene landscape with mountains and a lake at sunset",
    model="local-image-generation-model",
    size="1024x1024"
)

image_data = base64.b64decode(response.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Image Editing

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.png", "rb") as f:
    result = client.images.edit(
        image=f,
        prompt="make it like a photo in 1800s",
        model="flux-kontext-dev"
    )

image_data = base64.b64decode(result.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Function Calling

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

messages = [{"role": "user", "content": "What is the weather in Tokyo?"}]
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather in a given city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"}
            }
        }
    }
}]

completion = client.chat.completions.create(
    model="local-model",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

if completion.choices[0].message.tool_calls:
    tool_call = completion.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Embeddings

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.embeddings.create(
    model="local-model",
    input=["The quick brown fox jumps over the lazy dog"]
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")

Structured Outputs (JSON Schema)

import openai
import json

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "Address",
        "schema": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "city": {"type": "string"},
                "state": {"type": "string"},
                "zip": {"type": "string"}
            },
            "required": ["street", "city", "state", "zip"]
        }
    }
}

completion = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Format: 1 Hacker Wy Menlo Park CA 94025"}],
    response_format=response_format
)

address = json.loads(completion.choices[0].message.content)
print(json.dumps(address, indent=2))

Advanced Configuration

Parser Configuration

For models requiring custom parsing (tool calls, reasoning):

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --tool-call-parser qwen3 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice

Available parsers: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2

Message Converters

For models requiring message format conversion:

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --message-converter glm4_moe

Available converters: glm4_moe, minimax_m2, nemotron3_nano, qwen3_coder

Custom Chat Templates

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --chat-template-file /path/to/template.jinja

Request Queue System

The server includes a request queue system with monitoring:

# Check queue status
curl http://localhost:8000/v1/queue/stats

Response:

{
  "status": "ok",
  "queue_stats": {
    "running": true,
    "queue_size": 3,
    "max_queue_size": 100,
    "active_requests": 1,
    "max_concurrency": 1
  }
}

Example Notebooks

Check the examples/ directory for comprehensive guides:

audio_examples.ipynb - Audio processing
embedding_examples.ipynb - Text embeddings
lm_embeddings_examples.ipynb - Language model embeddings
vlm_embeddings_examples.ipynb - Vision-language embeddings
vision_examples.ipynb - Vision capabilities
image_generations.ipynb - Image generation
image_edit.ipynb - Image editing
structured_outputs_examples.ipynb - JSON schema outputs
simple_rag_demo.ipynb - RAG pipeline demo

Large Models

For models that don't fit in RAM, improve performance on macOS 15.0+:

bash configure_mlx.sh

This raises the system's wired memory limit for better performance.

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

Follow Conventional Commits for commit messages.

Support

Documentation: This README and example notebooks
Issues: GitHub Issues
Discussions: GitHub Discussions
Video Tutorials: Setup Demo, RAG Demo

License

MIT License - see LICENSE file for details.

Acknowledgments

Built on top of:

MLX - Apple's ML framework
mlx-lm - Language models
mlx-vlm - Multimodal models
mlx-embeddings - Embeddings
mflux - Flux image models
mlx-whisper - Audio transcription
mlx-community - Model repository

Name		Name	Last commit message	Last commit date
Latest commit History 728 Commits
.github/workflows		.github/workflows
app		app
docs		docs
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
configure_mlx.sh		configure_mlx.sh
pyproject.toml		pyproject.toml

License

cubist38/mlx-openai-server

Folders and files

Latest commit

History

Repository files navigation

mlx-openai-server

Key Features

Installation

Prerequisites

Quick Install

Optional: Whisper Support

Quick Start

Start the Server

Server Parameters

Supported Model Types

Image Model Configurations

Using the API

Text Completion

Vision (Multimodal)

Image Generation

Image Editing

Function Calling

Embeddings

Structured Outputs (JSON Schema)

Advanced Configuration

Parser Configuration

Message Converters

Custom Chat Templates

Request Queue System

Example Notebooks

Large Models

Contributing

Support

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 53

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages